data.table apply a user defined function over rows

Question

I have a huge data.table dt (almost 1.5 million rows) let say i want to apply a user defined function growth.ls to its rows, where scols (some columns in dt) are the arguments as

growth.ls <- function(values){
  if (any(!is.finite(values)) || any(values <= 0)) return(NA_real_)
  exp(lm(log(values) ~ (seq_along(values)))$coefficients[[2]] - 1) * 100}
dt[, `:=`(var = growth.ls(as.numeric(.SD))), .SDcols = scols, by = 1:nrow(dt)]

this process takes a very long time, I do not know if the problem is the growth.ls, or i am because i am using by: 1:nrow(dt).

Yeah, that is not a good way to use a data.frame or data.table -- you are splitting up into a large number of rows, then coercing each row from a data.table to a numeric vector. Try using `melt` to put your data in long form instead maybe — Frank, Feb 19 '18 at 19:39
you can probably speed things up a lot by using `lm.fit`; pre-allocating your model matrix (`X <- cbind(1,1:ncol(dt))`); and possibly even computing the regression slope [directly](http://www.statisticshowto.com/how-to-find-a-linear-regression-slope/) - you can precompute everything except sum(y) and sum(x*y) ... — Ben Bolker, Feb 19 '18 at 20:05
not right now. maybe someone else will come along and provide one. — Ben Bolker, Feb 19 '18 at 20:49
relevant: https://stackoverflow.com/questions/29803993/fast-linear-regression-by-group/29806540#29806540 and a rolling version: https://codereview.stackexchange.com/questions/125509/rolling-regressions-in-r — chinsoon12, Feb 20 '18 at 01:20
x <- data.table(y = letters[1:4], x1990 = c(1,1,1,2), x1991 = c(2,1,1,1), x1992 = c(3,3,3,0.5), x1993 = c(5,2,2,4), x1994 = c(7,3, 5, NA_real_), x1995 = c(9, 8, 10,1)) if applied growth.ls to year columns (x1990: x1995) apply(x[, paste0("x", 1990:1995), with = FALSE], 1, growth.ls) # [1] 56.88514 53.77536 58.00203 NA I want to use this concept to produce a column in x called "growth" using data.table. — Mohamed, Feb 20 '18 at 04:38

score 1 · Accepted Answer · answered Feb 21 '18 at 12:37

1

What about this (using multicores with data.table):

library(parallel)
cl = makeCluster(detectCores())
choose_cols = startsWith(colnames(df),'x')

df[,growth := unlist(parApply(cl, .SD, 1, growth.ls), .SDcols = choose_cols]

answered Feb 21 '18 at 12:37

YOLO

20,181
5
20
40

Seriously i do not know what to say except thank you very much for ur help an support. parallel is powerful and consumed all processor up to 99% and done all job in 5 mins. – Mohamed Feb 23 '18 at 08:54
Glad to know. Accept the answer if it solves your problem. It helps us in keeping track. :) – YOLO Feb 23 '18 at 12:16

data.table apply a user defined function over rows

1 Answers1

Linked