How to improve efficiency when conditionally mutating multiple columns?

Question

Recently I posted a similar question. Nevertheless, while the solution kindly provided by @akun successfully delivers the desired output, I am facing problems concerning computation time when I apply it to my real data, which is considerably large with over 100000*500 datapoints.

I would like to know if there is any alternative approach for large data. Below I present my attempt to deal with the problem. It is based on parallel processing but so far it has been unsuccessful. I am still trying but any help would be much appreciated.

My data

df<-as.data.frame(structure(list(low_account = c(1, 1, 0.5, 0.5, 0.5, 0.5), high_account = c(16, 
16, 56, 56, 56, 56), mid_account_0 = c(8.5, 8.5, 28.25, 28.25, 
28.25, 28.25), mean_account_0 = c(31.174, 30.1922101449275, 30.1922101449275, 
33.3055555555556, 31.174, 33.3055555555556), median_account_0 = c(2.1, 
3.8, 24.2, 24.2, 24.2, 24.2), low_account.1 = c(1, 1, 0.5, 0.5, 0.5, 
0.5), high_account.1 = c(16, 16, 56, 56, 56, 56), row.names = c("A001", "A002", "A003", "A004", "A005", "A006"))))

df
  low_account high_account mid_account_0 mean_account_0 median_account_0 low_account.1 high_account.1 row.names
1         1.0           16          8.50       31.17400              2.1           1.0             16      A001
2         1.0           16          8.50       30.19221              3.8           1.0             16      A002
3         0.5           56         28.25       30.19221             24.2           0.5             56      A003
4         0.5           56         28.25       33.30556             24.2           0.5             56      A004
5         0.5           56         28.25       31.17400             24.2           0.5             56      A005
6         0.5           56         28.25       33.30556             24.2           0.5             56      A006

My attempt

library(tidyverse)
df %>% 
   parallel::mcmapply(as.matrix(mutate_at(vars(matches("(mean|median|midrange)account")), ~ replace(., .<= low_account | .>= high_account, NA))), df)

Error in get(as.character(FUN), mode = "function", envir = envir) : 
  object 'FUN' of mode 'function' was not found

Expected Output

df
    low_account high_account mid_account_0 mean_account_0 median_account_0 low_account.1 high_account.1 row.names
    1         1.0           16          8.50       NA                    2.1           1.0             16      A001
    2         1.0           16          8.50       NA                    3.8           1.0             16      A002
    3         0.5           56         28.25       30.19221             24.2           0.5             56      A003
    4         0.5           56         28.25       33.30556             24.2           0.5             56      A004
    5         0.5           56         28.25       31.17400             24.2           0.5             56      A005
    6         0.5           56         28.25       33.30556             24.2           0.5             56      A006

score 3 · Accepted Answer · answered Jul 17 '19 at 15:03

You could try a base R solution by first pulling out the columns we want to apply the condition to:

df_matches <-stringr::str_detect(names(df),'(mid|mean|median)_account')
df_matches <- names(df)[df_matches]

And then finding the subset that meets our condition and replacing those with NAs:

df[df_matches][df[df_matches] <= df$low_account | df[df_matches] >= df$high_account] <- NA

#   low_account high_account mid_account_0 mean_account_0 median_account_0 low_account.1
# 1         1.0           16          8.50             NA              2.1           1.0
# 2         1.0           16          8.50             NA              3.8           1.0
# 3         0.5           56         28.25       30.19221             24.2           0.5
# 4         0.5           56         28.25       33.30556             24.2           0.5
# 5         0.5           56         28.25       31.17400             24.2           0.5
# 6         0.5           56         28.25       33.30556             24.2           0.5
#   high_account.1 row.names
# 1             16      A001
# 2             16      A002
# 3             56      A003
# 4             56      A004
# 5             56      A005
# 6             56      A006

This is around 7x faster than your given solution with the data your provided:

library(microbenchmark)

microbenchmark(
  {
    df %>% 
      mutate_at(vars(matches("(mid|mean|median)_account")),
                ~ replace(., .<= low_account | .>= high_account, NA))


  },
  {

    df[df_matches][df[df_matches] <= df$low_account | df[df_matches] >= df$high_account] <- NA
  }


)


 # min       lq       mean     median    uq        max      neval
 # 2183.264 2295.653 2750.3255 2420.034 3003.7330 6188.024   100
 # 310.392  340.145  453.5984  410.258  449.3935  2005.300   100

score 1 · Answer 2 · answered Jul 18 '19 at 02:20

If OP dont mind using data.table package, here are some approaches that should be faster on 50 million rows:

library(data.table)
DT <- setDT(copy(df))
DT1 <- copy(DT)
DT2 <- copy(DT)
cols <- grep('(mid|mean|median)_account', names(DT), value=TRUE)

#approach 1
for (j in cols)
    set(DT, DT[!between(get(j), low_account, high_account, FALSE), which=TRUE], j, NA_real_)

#approach 2
DT1[, (cols) := {
        m <- copy(.SD); 
        m[.SD <= low_account | .SD >= high_account] <- NA; 
        m
    }, .SDcols=cols]

#approach 3
DT2[, lapply(.SD, function(x) replace(x, !between(x, low_account, high_account, FALSE), NA_real_)), 
    .SDcols=cols]

How to improve efficiency when conditionally mutating multiple columns?

2 Answers2