1

I have a huge data.table dt (almost 1.5 million rows) let say i want to apply a user defined function growth.ls to its rows, where scols (some columns in dt) are the arguments as

growth.ls <- function(values){
  if (any(!is.finite(values)) || any(values <= 0)) return(NA_real_)
  exp(lm(log(values) ~ (seq_along(values)))$coefficients[[2]] - 1) * 100}
dt[, `:=`(var = growth.ls(as.numeric(.SD))), .SDcols = scols, by = 1:nrow(dt)]

this process takes a very long time, I do not know if the problem is the growth.ls, or i am because i am using by: 1:nrow(dt).

Frank
  • 66,179
  • 8
  • 96
  • 180
Mohamed
  • 95
  • 7
  • Yeah, that is not a good way to use a data.frame or data.table -- you are splitting up into a large number of rows, then coercing each row from a data.table to a numeric vector. Try using `melt` to put your data in long form instead maybe – Frank Feb 19 '18 at 19:39
  • you can probably speed things up a lot by using `lm.fit`; pre-allocating your model matrix (`X <- cbind(1,1:ncol(dt))`); and possibly even computing the regression slope [directly](http://www.statisticshowto.com/how-to-find-a-linear-regression-slope/) - you can precompute everything except sum(y) and sum(x*y) ... – Ben Bolker Feb 19 '18 at 20:05
  • Is it possible to provide an example so i could try it. – Mohamed Feb 19 '18 at 20:24
  • not right now. maybe someone else will come along and provide one. – Ben Bolker Feb 19 '18 at 20:49
  • relevant: https://stackoverflow.com/questions/29803993/fast-linear-regression-by-group/29806540#29806540 and a rolling version: https://codereview.stackexchange.com/questions/125509/rolling-regressions-in-r – chinsoon12 Feb 20 '18 at 01:20
  • x <- data.table(y = letters[1:4], x1990 = c(1,1,1,2), x1991 = c(2,1,1,1), x1992 = c(3,3,3,0.5), x1993 = c(5,2,2,4), x1994 = c(7,3, 5, NA_real_), x1995 = c(9, 8, 10,1)) if applied growth.ls to year columns (x1990: x1995) apply(x[, paste0("x", 1990:1995), with = FALSE], 1, growth.ls) # [1] 56.88514 53.77536 58.00203 NA I want to use this concept to produce a column in x called "growth" using data.table. – Mohamed Feb 20 '18 at 04:38

1 Answers1

1

What about this (using multicores with data.table):

library(parallel)
cl = makeCluster(detectCores())
choose_cols = startsWith(colnames(df),'x')

df[,growth := unlist(parApply(cl, .SD, 1, growth.ls), .SDcols = choose_cols]
YOLO
  • 20,181
  • 5
  • 20
  • 40
  • Seriously i do not know what to say except thank you very much for ur help an support. parallel is powerful and consumed all processor up to 99% and done all job in 5 mins. – Mohamed Feb 23 '18 at 08:54
  • Glad to know. Accept the answer if it solves your problem. It helps us in keeping track. :) – YOLO Feb 23 '18 at 12:16