User-defined functions with multiple variables? / Function to operate on a range of named columns (instead of indexed by number)?

Question

I frequently want to perform functions on a set or range of columns in a dataframe. Most commonly, I want to take the mean of a range of columns that share a common prefix (in this toy example, VarA., VarB., and VarC.:

ID<-c(1:300) #participant ID column, N=300
VarA.1<-sample(x = 0:50,size = 300, replace = TRUE)
VarA.2<-sample(x = 0:50,size = 300, replace = TRUE)
VarA.3<-sample(x = 0:50,size = 300, replace = TRUE)
VarB.1<-sample(x = 0:30,size = 300, replace = TRUE)
VarB.2<-sample(x = 0:30,size = 300, replace = TRUE)
VarB.3<-sample(x = 0:30,size = 300, replace = TRUE)
VarC.1<-sample(x = 0:10,size = 300, replace = TRUE)
VarC.2<-sample(x = 0:10,size = 300, replace = TRUE)
VarC.3<-sample(x = 0:10,size = 300, replace = TRUE)


df<-data.frame(ID,VarA.1,VarA.2,VarA.3,
               VarB.1,VarB.2,VarB.3,
               VarC.1,VarC.2,VarC.3) 
rm(ID,VarA.1,VarA.2,VarA.3,
   VarB.1,VarB.2,VarB.3,
   VarC.1,VarC.2,VarC.3)

I usually have a ton of variables, so I can't memorize the column numbers. Let's say I want to take the average of all columns starting with VarA. and put it in a new column called VarA. Here is my usual approach:

x<-which(colnames(df)=="VarA.1")
y<-which(colnames(df)=="VarA.3")
df$VarA<-rowMeans(df[, c(x:y)])

Maybe I'm being too picky, but given that I have to do this (or something very similar) upwards of 20 times in some scripts, it just looks really messy and clunky, and it's hard to remember: I have to dig up a previous file and then copy and paste and carefully change all the values to fit my current dataset. I'd really like to make this into a function, but I'm not very familiar with user-defined functions and I'm having trouble figuring out how to deal with multiple variables.

The approach I tried was:

colmeans <- function(x,y,df,meancol) {
  first<-which(colnames(df)==x)
  last<-which(colnames(df)==y)
  df$meancol<-rowMeans(df[, c(first:last)])
}
colmeans("VarA.1","VarA.3",df,"VarA")

I could have sworn it worked at one point but I lost it and I can't remember what I changed. What am I missing?

I'm also open to other ideas about how to make this process more efficient.

akrun · Answer 1 · 2021-07-13T20:42:42.920

We can use split.default

lst1 <- lapply(split.default(df[-1], sub("\\.\\d+$", "", names(df)[-1])),
         rowMeans, na.rm = TRUE)
df[paste0(names(lst1), "_Mean")] <- lst1

-output

head(df, 3)
  ID VarA.1 VarA.2 VarA.3 VarB.1 VarB.2 VarB.3 VarC.1 VarC.2 VarC.3 VarA_Mean VarB_Mean VarC_Mean
1  1     25     40     27      6      9      2      3      1      0  30.66667  5.666667  1.333333
2  2     43     16     26     27      7      5      2     10      5  28.33333 13.000000  5.666667
3  3     14     34     38      9     10      7      3      9      9  28.66667  8.666667  7.000000

Or using tidyverse

library(dplyr)
library(tidyr)
df %>%
    pivot_longer(cols = -ID, names_to = ".value",
        names_pattern = "^([^.]+)\\.\\d+") %>% 
    group_by(ID) %>%
    summarise(across(everything(), mean, na.rm = TRUE, 
          .names = "{.col}_Mean"), .groups = 'drop') %>%
  left_join(df)

score 2 · Answer 2 · answered Jul 13 '21 at 20:59

Are you looking for such a solution?

library(dplyr)
df %>% 
  mutate(across(starts_with("VarA"), mean, .names = "mean_{.col}"))

Output:

+   head()
  ID VarA.1 VarA.2 VarA.3 VarB.1 VarB.2 VarB.3 VarC.1 VarC.2 VarC.3 mean_VarA.1 mean_VarA.2
1  1     41     37      7      0     14      1     10      7      0    23.97667    24.73667
2  2     50     36     20     16     29      7      0     10      9    23.97667    24.73667
3  3      2      5     43     20     24      9      5      8      8    23.97667    24.73667
4  4      9     39     41     15     21      5      9      6      3    23.97667    24.73667
5  5     38     25     37     20     19     24      6      5      4    23.97667    24.73667
6  6     12     27     47     28     14     14     10      5      2    23.97667    24.73667
  mean_VarA.3
1    26.16667
2    26.16667
3    26.16667
4    26.16667
5    26.16667
6    26.16667

User-defined functions with multiple variables? / Function to operate on a range of named columns (instead of indexed by number)?

2 Answers2