4

A part of the project I'm working on is determining residuals. I'm doing this performing linear models.
Unfortunately the packages I have found do either not meet the requirements or are glitchy.


I have tried using the following packages for my project.

  1. lm - Standard linear modelling function in R
      + pro's -- None
      - cons -- uses standard statistic library, single core, cannot handle out of memory calculations
  2. fastLm - part of RcppArmadillo package
      + pro's -- Multicore
      - cons -- Cannot handle out of memory calculations.
  3. biglm - part of the biglm package
      + pro's -- Special designed for handling out of memory calculations by splitting up the data in chunks
      - cons -- Single core
  4. speedlm - part of the speedglm package
      + pro's -- Multicore, should be able to handle out of memory calculations by splitting up the data in chunks

Some problems I personally ran onto using speedlm, otherwise this would have been the package of choice:


After googling without success, I have used the following code in attempt to find new packages, attempting different keywords but I simply cannot seem to find any appropiate packages.

find <- findFn("linear model lm", sortby="function", maxPages = 10)
format(find)

Is there any Linear model packages besides theones I mentioned above which meet the following requirements:

  • Ability to use multiple CPU's to calculate linear models
  • Ability to split up the dataset and update the linear model with chunks of the dataset
  • Get fitted values
Community
  • 1
  • 1
Bas
  • 1,066
  • 1
  • 10
  • 28
  • @The person who downvoted, would you mind telling me how I can improve this question? – Bas Oct 21 '15 at 09:06
  • 1
    [H2O](http://h2o.ai/product/algorithms/) has a GLM that can handle this. But that's not in R, though you can run everything from R. – phiver Oct 21 '15 at 09:09
  • You can't. It's off-topic. And of course your pros/cons are very subjective. E.g., if you have a way to get fitted values you don't need a `residuals` method. – Roland Oct 21 '15 at 09:11
  • @Roland I have edited the pros/cons. And you are right, if I have a way to get the fitted values I don't need a `residuals` method. – Bas Oct 21 '15 at 09:19
  • Revolution R's edition (even the community one) uses multiple cores, vectorized SIMD (SSE/AVX) CPU operations *and* can process more data that can fit in memory. `lm` may end up being much faster than any other option simply because it uses vectorized operations. I'se seen the `svd` command perform 7 times faster – Panagiotis Kanavos Oct 21 '15 at 09:28
  • @PanagiotisKanavos - the proof is in the pudding, and Revolution's R may be prove a bit faster due to the use of another BLAS for the vector operations, but the advantage in lm probably won't be produced by multicore operations; see the comments in http://blog.revolutionanalytics.com/2010/06/performance-benefits-of-multithreaded-r.html and evaluate for yourself. – russellpierce Oct 21 '15 at 09:34
  • I have, 7x faster `svd` than plain R on a 2-year old desktop i7 running Windows 7, and all cores working instead of 1 (which means hyperthreading is also used). The proof is in the pudding indeed. Besides, you link to an *ancient* page, current CPUs have more and wider SIMD commands – Panagiotis Kanavos Oct 21 '15 at 09:36
  • I'll have to give it a try again. I attempted a recompile of R using a parallel BLAS (not Intel's) and didn't see any advantage for the problem I was working on; so I abandoned that approach. As far as I can tell the svd in R uses DGESDD and ZGESDD but DGESVD uses QR and QR is still not getting a multicore advantage (as far as I can tell). Maybe you can provide an answer with some benchmarks just for the sake of the record? – russellpierce Oct 21 '15 at 09:46

1 Answers1

3

Typical estimation procedures for linear models, e.g. what R uses for lm, involve QR decomposition which appears (in most BLASes; see below for more details) to be inherently a sequential process and therefore bound to a single core.

Other methods may be multicore, but may not accomplish your real aim - a faster calculation. I'll note two.

  1. You could explore alternate BLASes for R. However, as noted there "Multi-threaded BLAS libraries make no significant difference to real-world analysis problems using R". REvolution for example does provide a modified version of R that uses multiple cores when fitting some linear models... and may indeed prove a bit faster on parts of the operation involving vector operations. See the comments on one of their pages talking about the speed advantage of using a multicore BLAS and evaluate for yourself. Ultimately, the proof will be in the pudding - try it with your real-world problem and see if it gives you what you want (although I gather from existing comments it does not).
  2. You could look at results using the search term stochastic gradient decent. That method, given enough resources, may be able to give you a multicore solution that yields a speed benefit.

As an aside, the two methods you endorsed as multicore on quick review don't seem to me to be truely multicore. In general, it is easy to split data into chunks, and again I might be wrong, but I don't think you'll be able to process those chunks in parallel and recombine the models ... that is ... unless you are willing to do something general (in which case the methods you reject will work just as well).

The something general you might do, if you are willing to be a bit imprecise is:

  1. split your data up into samples
  2. run the samples separately and in parallel
  3. collect your regression coefficients and use the mean coefficients as actual coefficients
  4. calculate your predictions
  5. calculate your residuals

... but that doesn't solve your RAM issue and again - I question whether you'll find enough of a speed benefit to make it worth your while.

See also:

Community
  • 1
  • 1
russellpierce
  • 4,583
  • 2
  • 32
  • 44
  • 1
    A badly formatted answer attracts downvotes because it's *very* difficult to read – Panagiotis Kanavos Oct 21 '15 at 09:26
  • 2
    Formatting was a work in progress. Hopefully the above is now improved. – russellpierce Oct 21 '15 at 09:27
  • I'm almost certain this is wrong - Revolution R's distribution uses SIMD operations through the Intel Primitives library which *does* provide QR decomposition. It also provides multicore processing although that's probably on top of the Intel libraries. Just because a calculation is sequential doesn't mean it can't be performed using SIMD – Panagiotis Kanavos Oct 21 '15 at 09:34
  • @Rpierce Thanks for your in-depth answer. I forgot to mention in the question, I'm using `RRO` (Revolution R Open) which has improved BLASes built in. About the multi core part: `fastLm` and `speedlm` use all of my CPU's and result in way faster calculations. So what you are saying is this is likely to be less accurate as the default `lm` function in R? – Bas Oct 21 '15 at 09:34
  • @Bas RRO already uses multiple cores, did you check your CPU to see whether all cores are working? – Panagiotis Kanavos Oct 21 '15 at 09:35
  • @PanagiotisKanavos, You're right, I'll correct in line with my comment in response to the question. – russellpierce Oct 21 '15 at 09:36
  • @PanagiotisKanavos Yes, it does use multiple cores when performing `fastLm`/`speedlm` – Bas Oct 21 '15 at 09:37