2

I am using np.linalg.lstsq for calculating the multiple linear regression. My data set is huge: has 20,000 independent variables(X) and 1 dependent variable (Y). Each independent variable has 10,000 datas. Something like this:

                X1   X2     X3..  X20,000   Y
  data1 ->      10   1.8    1      1        3
  data2 ->      20   2.3    200    206      5
                ..    ..    ..     ..       ..
  data10,000->  300  2398  878    989       998  

It is taking huge time (20-30 mins) to compute the regression coefficient using np.linalg.lstsq. Can anybody tell me some better solution according to computation time?

user2567857
  • 483
  • 7
  • 25
  • You have `20,000` Independent variables and only `10,000` observation points? This seems like a problem to me, wouldn't you have some sort of multicollinearity issue? Aren't your degrees of freedom negative? Maybe you can do dimensionality reduction of you independent variables, for example PCA. – Akavall Jun 18 '14 at 15:21
  • No this is not a problem, as I am transposing my X while calculation. – user2567857 Jun 18 '14 at 15:47

2 Answers2

0

The time spent seems to follow n**2.8. You can increase the speed by reducing the number of data points.

If you downsample your data to only a thousand rows, you can do the computations in a couple of seconds. You can then repeat the analysis with a different random sample.

In order to combine the results, you have several options:

  • Do, as it is usual in cross correlation in statistics, and weight them by the inverse of the norm of the residuals (fast to compute, as it is already in the output).
  • Measure the real residuals for your full dataset (that takes less than three seconds) and:
    • Keep the best one.
    • Weight them by the inverse of the real distance.

The best option depends on how much accuracy you need, and the nature of your data. If you just need a gross estimation in medium noise, a single downsampling should work. Keep in mind that you are seriously under determined already, so your solutions will be degenerated.

Davidmh
  • 3,797
  • 18
  • 35
0

Use PCA to reduce the number of input data factors. This is good as you can specify what percentage of the variance you want to keep. I wouldn't be surprised if you could get rid of 90%+ of you data and keep most of the important features.

The basic idea is to map your data onto a lower dimensional set of axis than 20,000

For a ready made implementation check out http://mdp-toolkit.sourceforge.net/ for this.

user3684792
  • 2,542
  • 2
  • 18
  • 23