Update:
As pointed out by Dirk, my original approach can be greatly improved upon by specifying x
and Y
directly rather than using the formula-based interface of fastLm
, which incurs (a fairly significant) processing overhead. For comparison, using the original full size data set,
R> system.time({
dt[,c("lm_b0", "lm_b1") := as.list(
unname(fastLm(x, Y)$coefficients))
,by = "user_id"]
})
# user system elapsed
#55.364 0.014 55.401
##
R> system.time({
dt[,c("lm_b0","lm_b1") := as.list(
unname(fastLm(Y ~ x, data=.SD)$coefficients))
,by = "user_id"]
})
# user system elapsed
#356.604 0.047 356.820
this simple change yields roughly a 6.5x speedup.
[Original approach]
There is probably some room for improvement, but the following took about 25 minutes on a Linux VM (2.6 GHz processor), running 64-bit R:
library(data.table)
library(RcppArmadillo)
##
dt[
,c("lm_b0","lm_b1") := as.list(
unname(fastLm(Y ~ x, data=.SD)$coefficients)),
by=user_id]
##
R> dt[c(1:2, 31:32, 61:62),]
user_id x Y lm_b0 lm_b1
1: 1 1.0 1674.8316 -202.0066 744.6252
2: 1 1.5 369.8608 -202.0066 744.6252
3: 2 1.0 463.7460 -144.2961 374.1995
4: 2 1.5 412.7422 -144.2961 374.1995
5: 3 1.0 513.0996 217.6442 261.0022
6: 3 1.5 1140.2766 217.6442 261.0022
Data:
dt <- data.table(
user_id = rep(1:500000,each=30))
##
dt[, x := seq(1, by=.5, length.out=30), by = user_id]
dt[, Y := 1000*runif(1)*x, by = user_id]
dt[, Y := Y + rnorm(
30,
mean = sample(c(-.05,0,0.5)*mean(Y),1),
sd = mean(Y)*.25),
by = user_id]