2

I have seen examples of using .SDwith lapply in data.table with a simple function as below:

DT[ , .(b,d,e) := lapply(.SD, tan), .SDcols = .(b,d,e)]

But I'm unsure of how to use column-specific arguments in a multiple argument function. For instance I have a winsorize function, I want to apply it to a subset of columns in a data table but using column-specific percentiles, e.g.

library(DescTools)
wlevel <- list(b=list(lower=0.01,upper=0.99), c=list(upper=0.02,upper=0.95))
DT[ , .(b,c) :=lapply(.SD, function(x) 
{winsorize(x,wlevel$zzz$lower,wlevel$zzz$upper)}), .SDcols = .(b,c)]

Where zzz will be the respective column to iterate. I have also seen threads on using changing arguments with lapply but not in the context of data table with .SDcols

Is this possible to do?

This is a toy example, looking to generalize for the case of arbitrary large number of columns; Looping is always an option but trying to see if there's a more elegant/efficient solution...

smci
  • 32,567
  • 20
  • 113
  • 146
jsilva99
  • 25
  • 2
  • There is more than one `winsorize` in R packages. To make your code example reproducible you need `library(DescTools)`. – smci Aug 28 '18 at 17:21
  • Also, better to use [list notation](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.html) for multiple columns: `DT[ , .(b,d,e) := lapply(.SD, tan), .SDcols = .(b,d,e)]` – smci Aug 28 '18 at 17:24

1 Answers1

1

How to use column-specific arguments in a multiple argument function?

Use mapply(FUN, dat, params1, params2, ...) where each of params1, params2, ... can be a list or vector; mapply iterates over each of dat, params1, params2, ... in parallel.

Note that unlike the rest of the apply/lapply/sapply family, with mapply the function argument comes first, then the data and parameter(s).

In your case (pseudo-code, you'll need to tweak it to get it to run) something like:

Instead of your nested list wlevel <- list(b=list(lower=0.01,upper=0.99), c=list(upper=0.02,upper=0.95)), probably easier to unpack to:

w_lower <- list(b=0.01, c=0.02)
w_upper <- list(b=0.99, c=0.95) 

DT[ , c('b','c') := mapply(function(x, w_lower_col, w_upper_col) { winsorize(x, w_lower_col, w_upper_col) },
  .SD, w_lower, w_upper), .SDcols = c('b', 'c')]

We shouldn't need to use column-names (your zzz) in indexing into the list, mapply() should just iterate over the list as-is.

smci
  • 32,567
  • 20
  • 113
  • 146
  • Thanks for your answer @smci but is not working for me. I get warning messages: column matrix RHS of := will be treated as vector. Supplied 105866 items to be assigned to 52933 items of column 'b' (52933 unused). Any thoughts? Not sure why I am supplying 2x nrows – jsilva99 Aug 26 '18 at 18:44
  • 1
    After further exploration, setting SIMPLIFY=F in mapply fixes the problem. – jsilva99 Aug 27 '18 at 00:05