30

I would like to get the slope of a linear regression fit for 1M separate data sets (1M * 50 rows for data.frame, or 1M * 50 for array). Now I am using the lm() function, which takes a very long time (about 10 min).

Is there any faster function for linear regression?

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Bangyou
  • 9,462
  • 16
  • 62
  • 94
  • 4
    You're complaining about ten minutes? Unbelieveable. Parallelize the calculations if the 1M data sets are independent. – duffymo Aug 21 '14 at 00:18
  • 2
    Just to clarify, are you referring to a dataset of 1M rows or 1M separate datasets? If it's the latter, maybe you should think about the data fishing implications of what you are doing first. – thelatemail Aug 21 '14 at 00:20
  • @duffymo Sorry for confusing. My dataset is about 1 M * 54. I already parallel them with 16 cores. I understand 10 min is not a big problem. Just try to find a faster way for linear regression. – Bangyou Aug 21 '14 at 00:22
  • @thelatemail It is 1 M separate datasets. – Bangyou Aug 21 '14 at 00:23
  • 8
    If you are only worried about the slope. It looks like you could calculate it directly using `sd` and `cor`. Check out this [post](http://statistics.about.com/od/Descriptive-Statistics/a/Slope-Of-Regression-Line-And-Correlation-Coefficient.htm). Slope = r*(sdy/sdx) – pbible Aug 21 '14 at 00:30
  • Thanks for your suggestion. yes I just need the slope. – Bangyou Aug 21 '14 at 00:38

4 Answers4

29

Yes there are:

  • R itself has lm.fit() which is more bare-bones: no formula notation, much simpler result set

  • several of our Rcpp-related packages have fastLm() implementations: RcppArmadillo, RcppEigen, RcppGSL.

We have described fastLm() in a number of blog posts and presentations. If you want it in the fastest way, do not use the formula interface: parsing the formula and preparing the model matrix takes more time than the actual regression.

That said, if you are regressing a single vector on a single vector you can simplify this as no matrix package is needed.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
18

Since 3.1.0 there is a .lm.fit() function. This function should be faster than lm() and lm.fit().

It's described and its performance is compared with different lm functions here - https://rpubs.com/maechler/fast_lm.

Jot eN
  • 6,120
  • 4
  • 40
  • 59
  • What if I want to use a model with random effects? Anything fast and able to deal with large datasets? lme4 is very slow and needs a lot of memory. – skan Nov 16 '18 at 19:17
8

lmfit in the package Rfast is even faster than .lm.fit. The only drawback is that it does not work when the design matrix does not have full rank.

6

speedlm from speedglm should do it as it works on large data sets.