Performance issue in computing multiple linear regression with huge data sets

Question

I am using np.linalg.lstsq for calculating the multiple linear regression. My data set is huge: has 20,000 independent variables(X) and 1 dependent variable (Y). Each independent variable has 10,000 datas. Something like this:

                X1   X2     X3..  X20,000   Y
  data1 ->      10   1.8    1      1        3
  data2 ->      20   2.3    200    206      5
                ..    ..    ..     ..       ..
  data10,000->  300  2398  878    989       998

It is taking huge time (20-30 mins) to compute the regression coefficient using np.linalg.lstsq. Can anybody tell me some better solution according to computation time?

You have `20,000` Independent variables and only `10,000` observation points? This seems like a problem to me, wouldn't you have some sort of multicollinearity issue? Aren't your degrees of freedom negative? Maybe you can do dimensionality reduction of you independent variables, for example PCA. — Akavall, Jun 18 '14 at 15:21
No this is not a problem, as I am transposing my X while calculation. — user2567857, Jun 18 '14 at 15:47

score 0 · Answer 1 · answered Jun 18 '14 at 14:40

The time spent seems to follow n**2.8. You can increase the speed by reducing the number of data points.

If you downsample your data to only a thousand rows, you can do the computations in a couple of seconds. You can then repeat the analysis with a different random sample.

In order to combine the results, you have several options:

Do, as it is usual in cross correlation in statistics, and weight them by the inverse of the norm of the residuals (fast to compute, as it is already in the output).
Measure the real residuals for your full dataset (that takes less than three seconds) and:
- Keep the best one.
- Weight them by the inverse of the real distance.

The best option depends on how much accuracy you need, and the nature of your data. If you just need a gross estimation in medium noise, a single downsampling should work. Keep in mind that you are seriously under determined already, so your solutions will be degenerated.

score 0 · Answer 2 · answered Jun 18 '14 at 15:29

Use PCA to reduce the number of input data factors. This is good as you can specify what percentage of the variance you want to keep. I wouldn't be surprised if you could get rid of 90%+ of you data and keep most of the important features.

The basic idea is to map your data onto a lower dimensional set of axis than 20,000

For a ready made implementation check out http://mdp-toolkit.sourceforge.net/ for this.

Performance issue in computing multiple linear regression with huge data sets

2 Answers2