My first question in this forum: so kindly excuse me if I have missed some protocols:

I am struggling for some time now to pass column names inside my custom function while using dplyr - mutate_at. I have a dataset "dt" with thousands of columns and I would like to perform mutate for some of these columns, but in a way which is dependant on the column name

I have this piece of code

Option 1:
relevantcols = c("A", "B", "C")
myfunc <- function(colname, x) {
   #write different logic per column name
}
dt%>%
  mutate_at(relevantcols, funs(myfunc(<what should i give?>,.)))

I tried approaching the problem in another way, i.e by iterating over relevantcols and applying mutate_at for each of the elements of the vector as follows

Option 2:
for (i in 1:length(relevantcols)){
  dt%>%
  mutate_at(relevantcols[i], funs(myfunc(relevantcols[i], .))
}

I get the colnames in Option 2, but it is 10 times slower than Option 1 (only if I could get the column names somehow in Option 1). Hence it would be great if someone can suggest how I should approach if I stick to Option 1. Thanks

Adding an example for more clarity

df = data.frame(employee=seq(1:5), Mon_channelA=runif(5,1,10), Mon_channelB=runif(5,1,10), Tue_channelA=runif(5,1,10),Tue_channelB=runif(5,1,10))
df
 employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
1        1     5.234383     6.857227     4.480943     7.233947
2        2     7.441399     3.777524     2.134075     6.310293
3        3     7.686558     8.598688     9.814882     9.192952
4        4     6.033345     5.658716     5.167388     3.018563
5        5     5.595006     7.582548     9.302917     6.071108
relevantcols = c("Mon_channelA", "Mon_channelB")
myfunc <- function(colname, x) {
#based on the channel and weekday, compare the data from corresponding column with  the same channel but different weekday and return T if higher else F
}
# required output
employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
1        1     T     F     4.480943     7.233947
2        2     T     F     2.134075     6.310293
3        3     F     F     9.814882     9.192952
4        4     T     T     5.167388     3.018563
5        5     F     T     9.302917     6.071108
New contributor
Tanmoy Chatterjee is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
  • Could you give a concrete example of the "different logic per column name", so we have an example function to work with? It will also help to have an example dataset along with your desired output. You can see ideas for how to make a reproducible example here. – aosmith 1 hour ago
  • Done, hope it helps in understanding the situation – Tanmoy Chatterjee 58 mins ago
  • I'm unclear on part of the logic of the output you want. For Monday you want true/false values, but for Tuesday you want the values corresponding to Tuesdays? Why not all true/false or all numeric? – camille 27 mins ago
  • @camille you are right. In the end all values will be boolean or numeric. so tuesday's values will have to be compared with wednesdays with some exceptions on weekends and so on....i just wanted to explain the point of having the requirement of passing the column name into myfunc...note that in the original dataset, I have more than 5000 columns with various combinations of channels and weekdays. – Tanmoy Chatterjee 15 mins ago

You can do things like :

L <- c("A","B")
df <- data.frame(A=rep(1:3,2),B=1:6,C=7:12)
df
#  A B  C
#1 1 1  7
#2 2 2  8
#3 3 3  9
#4 1 4 10
#5 2 5 11
#6 3 6 12

f <- function(x,y) x^y

df %>% mutate_at(L,funs(f(.,2)))
#  A  B  C
#1 1  1  7
#2 4  4  8
#3 9  9  9
#4 1 16 10
#5 4 25 11
#6 9 36 12
  • Thanks but still I don't get how f can return different values for A and B (which is what I want) – Tanmoy Chatterjee 56 mins ago

I left a comment about data types, but assuming that that is what you're looking for, here's the approach I take to these sorts of problems. I do this in a seemingly convoluted process of reshaping a few times, but it lets you set up the variables that you're trying to compare without hard-coding much. I'll break it into pieces.

library(tidyverse)

set.seed(928)
df <- data.frame(employee=seq(1:5), Mon_channelA=runif(5,1,10), Mon_channelB=runif(5,1,10), Tue_channelA=runif(5,1,10),Tue_channelB=runif(5,1,10))

First, I'd reshape it into a long shape and break the "Mon_channelA", etc apart into a day and a channel. This lets you use the channel designation to match values for comparison.

df %>%
  gather(key, value, -employee) %>%
  separate(key, into = c("day", "channel"), sep = "_") %>%
  head()
#>   employee day  channel    value
#> 1        1 Mon channelA 2.039619
#> 2        2 Mon channelA 8.153684
#> 3        3 Mon channelA 9.027932
#> 4        4 Mon channelA 1.161967
#> 5        5 Mon channelA 3.583353
#> 6        1 Mon channelB 7.102797

Then, bring it back into a wide format based on the days. Now you have a column for each day for each combination of employee and channel.

df %>%
  gather(key, value, -employee) %>%
  separate(key, into = c("day", "channel"), sep = "_") %>%
  spread(key = day, value = value) %>%
  head()
#>   employee  channel      Mon      Tue
#> 1        1 channelA 2.039619 9.826677
#> 2        1 channelB 7.102797 7.388568
#> 3        2 channelA 8.153684 5.848375
#> 4        2 channelB 6.299178 9.452274
#> 5        3 channelA 9.027932 5.458906
#> 6        3 channelB 7.029408 7.087011

Then do your comparison, and take the data long again. Note that because the value column has numeric values, everything becomes numeric and the logical values are converted to 1 or 0.

df %>%
  gather(key, value, -employee) %>%
  separate(key, into = c("day", "channel"), sep = "_") %>%
  spread(key = day, value = value) %>%
  mutate(Mon = Mon > Tue) %>%
  gather(key = day, value = value, Mon, Tue) %>%
  head()
#>   employee  channel day value
#> 1        1 channelA Mon     0
#> 2        1 channelB Mon     0
#> 3        2 channelA Mon     1
#> 4        2 channelB Mon     0
#> 5        3 channelA Mon     1
#> 6        3 channelB Mon     0

Last few steps are to stick the day and channel back together to make the labels as you had them, spread back to a wide format, and turn all the columns starting with "Mon" back into logicals.

df %>%
  gather(key, value, -employee) %>%
  separate(key, into = c("day", "channel"), sep = "_") %>%
  spread(key = day, value = value) %>%
  mutate(Mon = Mon > Tue) %>%
  gather(key = day, value = value, Mon, Tue) %>%
  unite("variable", day, channel) %>%
  spread(key = variable, value = value) %>%
  mutate_at(vars(starts_with("Mon")), as.logical)
#>   employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
#> 1        1        FALSE        FALSE     9.826677     7.388568
#> 2        2         TRUE        FALSE     5.848375     9.452274
#> 3        3         TRUE        FALSE     5.458906     7.087011
#> 4        4        FALSE        FALSE     8.854263     8.946458
#> 5        5        FALSE        FALSE     6.933054     8.450741

Created on 2018-09-28 by the reprex package (v0.2.1)

Your Answer

Tanmoy Chatterjee is a new contributor. Be nice, and check out our Code of Conduct.
 

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Not the answer you're looking for? Browse other questions tagged or ask your own question.