18

I have 500K users and I need to compute a linear regression (with intercept) for each of them.

Each user has around 30 records.

I tried with dplyr and lm and this is way too slow. Around 2 sec by user.

  df%>%                       
      group_by(user_id, add =  FALSE) %>%
      do(lm = lm(Y ~ x, data = .)) %>%
      mutate(lm_b0 = summary(lm)$coeff[1],
             lm_b1 = summary(lm)$coeff[2]) %>%
      select(user_id, lm_b0, lm_b1) %>%
      ungroup()
    )

I tried to use lm.fit which is known to be faster but it doesn't seem to be compatible with dplyr.

Is there a fast way to do a linear regression by group?

  • 2
    The answer is likely to depend on where the inefficiency comes from. Did you do some profiling? Is splitting up the data the slow process, or is fitting the model the slow part? If the latter is the case, you may want to look at e.g. `fastLm` from the `RcppArmadillo` package. – coffeinjunky Apr 22 '15 at 16:53
  • The slow parts are : `do(lm = lm(Y ~ x, data = .))` & `mutate(lm_b0 = summary(lm)$coeff[1], lm_b1 = summary(lm)$coeff[2])`. How could I use fastLm on 500K users efficiently ? –  Apr 22 '15 at 16:58
  • 1
    @psql I had to do this exact thing and the most efficient way I could think of was to do it this way and it ended up taking around 2 hours with `dplyr` Here is some benchmarking from a question http://stackoverflow.com/questions/29641366/data-table-option-for-check-and-batch-of-linear-models – Vedda Apr 22 '15 at 17:00
  • In my case it will take around 10 days, so I really need to speed up this solution ! :) ps: it's totally fine if the results are rough!. –  Apr 22 '15 at 17:08
  • 4
    Why not just fit a single regression using `user_id` as a factor and interaction term? `Y ~ x * user_id + 0` will get you a different slope and intercept for each user (assuming `user_id` is a factor). This will be *much* faster with `lm` and be even faster with `fastLm`. – Gregor Thomas Apr 22 '15 at 17:29

5 Answers5

22

You can just use the basic formulas for calculating slope and regression. lm does a lot of unnecessary things if all you care about are those two numbers. Here I use data.table for the aggregation, but you could do it in base R as well (or dplyr):

system.time(
  res <- DT[, 
    {
      ux <- mean(x)
      uy <- mean(y)
      slope <- sum((x - ux) * (y - uy)) / sum((x - ux) ^ 2)
      list(slope=slope, intercept=uy - slope * ux)
    }, by=user.id
  ]
)

Produces for 500K users ~30 obs each (in seconds):

 user  system elapsed 
 7.35    0.00    7.36 

Or about 15 microseconds per user.

Update: I ended up writing a bunch of blog posts that touch on this as well.

And to confirm this is working as expected:

> summary(DT[user.id==89663, lm(y ~ x)])$coefficients
             Estimate Std. Error   t value  Pr(>|t|)
(Intercept) 0.1965844  0.2927617 0.6714826 0.5065868
x           0.2021210  0.5429594 0.3722580 0.7120808
> res[user.id == 89663]
   user.id    slope intercept
1:   89663 0.202121 0.1965844

Data:

set.seed(1)
users <- 5e5
records <- 30
x <- runif(users * records)
DT <- data.table(
  x=x, y=x + runif(users * records) * 4 - 2, 
  user.id=sample(users, users * records, replace=T)
)
BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • 1
    I started my answer "If all you want is coefficients...", but of course this answer is **the way** to do it if that's really all you want. – Gregor Thomas Apr 22 '15 at 19:11
  • @BrodieG Why do you return a list in your `fast_lm` function ? A vector couldn't be used ? `c(slope, intercept)` ? – Ricol Apr 22 '15 at 20:42
  • 1
    @Ricol, so that `data.table` would interpret them as distinct columns. – BrodieG Apr 22 '15 at 21:04
11

If all you want is coefficients, I'd just use user_id as a factor in the regression. Using @miles2know's simulated data code (though renaming since an object other than exp() sharing that name looks weird to me)

dat <- data.frame(id = rep(c("a","b","c"), each = 20),
                  x = rnorm(60,5,1.5),
                  y = rnorm(60,2,.2))

mod = lm(y ~ x:id + id + 0, data = dat)

We fit no global intercept (+ 0) so that the intercept for each id is the id coefficient, and no x by itself, so that the x:id interactions are the slopes for each id:

coef(mod)
#      ida      idb      idc    x:ida    x:idb    x:idc 
# 1.779686 1.893582 1.946069 0.039625 0.033318 0.000353 

So, for the a level of id, the ida coefficient, 1.78, is the intercept and the x:ida coefficient, 0.0396, is the slope.

I'll leave the gathering of these coefficients into appropriate columns of a data frame to you...

This solution should be very fast because you're not having to deal with subsets of data frames. It could probably be sped up even more with fastLm or such.

Note on scalability:

I did just try this on @nrussell's simulated full-size data and ran into memory allocation issues. Depending on how much memory you have it may not work in one go, but you could probably do it in batches of user ids. Some combination of his answer and my answer might be the fastest overall---or nrussell's might just be faster---expanding the user id factor into thousands of dummy variables might not be computationally efficient, as I've been waiting more than a couple minutes now for a run on just 5000 user ids.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 3
    `sparse.model.matrix()` (from the `Matrix` package) and `lm.fit` might be worth considering. I'd be curious how `lme4::lmer` would do on this problem ... – Ben Bolker Apr 22 '15 at 19:18
  • 2
    Or maybe chunk the users into a bunch of groups and use `parallel` or other multicore tools? – Carl Witthoft Apr 22 '15 at 19:21
8

Update: As pointed out by Dirk, my original approach can be greatly improved upon by specifying x and Y directly rather than using the formula-based interface of fastLm, which incurs (a fairly significant) processing overhead. For comparison, using the original full size data set,

R> system.time({
  dt[,c("lm_b0", "lm_b1") := as.list(
    unname(fastLm(x, Y)$coefficients))
    ,by = "user_id"]
})
#  user  system elapsed 
#55.364   0.014  55.401 
##
R> system.time({
  dt[,c("lm_b0","lm_b1") := as.list(
    unname(fastLm(Y ~ x, data=.SD)$coefficients))
    ,by = "user_id"]
})
#   user  system elapsed 
#356.604   0.047 356.820

this simple change yields roughly a 6.5x speedup.


[Original approach]

There is probably some room for improvement, but the following took about 25 minutes on a Linux VM (2.6 GHz processor), running 64-bit R:

library(data.table)
library(RcppArmadillo)
##
dt[
  ,c("lm_b0","lm_b1") := as.list(
    unname(fastLm(Y ~ x, data=.SD)$coefficients)),
  by=user_id]
##
R> dt[c(1:2, 31:32, 61:62),]
   user_id   x         Y     lm_b0    lm_b1
1:       1 1.0 1674.8316 -202.0066 744.6252
2:       1 1.5  369.8608 -202.0066 744.6252
3:       2 1.0  463.7460 -144.2961 374.1995
4:       2 1.5  412.7422 -144.2961 374.1995
5:       3 1.0  513.0996  217.6442 261.0022
6:       3 1.5 1140.2766  217.6442 261.0022

Data:

dt <- data.table(
  user_id = rep(1:500000,each=30))
##
dt[, x := seq(1, by=.5, length.out=30), by = user_id]
dt[, Y := 1000*runif(1)*x, by = user_id]
dt[, Y := Y + rnorm(
  30, 
  mean = sample(c(-.05,0,0.5)*mean(Y),1), 
  sd = mean(Y)*.25), 
  by = user_id]
nrussell
  • 18,382
  • 4
  • 47
  • 60
  • 2
    Did it really take 25 minutes, not seconds? – BrodieG Apr 22 '15 at 19:01
  • 1
    @BrodieG Yes, I was also surprised by that. I'm assuming it's because I ran this on a virtual machine with somewhat limited resources rather than a proper (4+ Core, 8 GB RAM, etc...) desktop machine. I like your thinking-outside-of-the-box approach btw, +1 – nrussell Apr 22 '15 at 19:09
  • 1
    this is a neat approach. – miles2know Apr 23 '15 at 00:46
  • 1
    Rookie error :) Deparsing the formula takes *way more* time than `fastLm` saves. Try again with explicit `y` and `X` as vector and matrix. – Dirk Eddelbuettel Dec 02 '15 at 02:01
  • @DirkEddelbuettel Whoops - good catch! I knew there was something fishy about those results... – nrussell Dec 02 '15 at 13:41
  • 1
    Adding the formula interface was a mixed blessing. We want it for consistency of interfaces but not for messing with performance ... Nice update to the answer, by the way. – Dirk Eddelbuettel Dec 02 '15 at 15:18
6

You might give this a try using data.table like this. I've just created some toy data but I'd imagine data.table would give some improvement. It's quite speedy. But that is quite a large data-set so perhaps benchmark this approach on a smaller sample to see if the speed is a lot better. good luck.


    library(data.table)

    exp <- data.table(id = rep(c("a","b","c"), each = 20), x = rnorm(60,5,1.5), y = rnorm(60,2,.2))
    # edit: it might also help to set a key on id with such a large data-set
    # with the toy example it would make no diff of course
    exp <- setkey(exp,id)
    # the nuts and bolts of the data.table part of the answer
    result <- exp[, as.list(coef(lm(y ~ x))), by=id]
    result
       id (Intercept)            x
    1:  a    2.013548 -0.008175644
    2:  b    2.084167 -0.010023549
    3:  c    1.907410  0.015823088
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
miles2know
  • 737
  • 8
  • 17
1

An example using Rfast.

Assuming a single response and 500K predictor variables.

y <- rnorm(30)
x <- matrnorm(500*1000,30)
system.time( Rfast::univglms(y, x,"normal") )  ## 0.70 seconds

Assuming 500K response variables and a singl predictor variable.

system.time( Rfast::mvbetas(x,y) )  ## 0.60 seconds

Note: The above times will decrease in the nearby future.

Mike
  • 106
  • 5