1

I have a 50,000 x 15 numpy matrix with continuous data. I want to use MDS (Multi-Dimensional Scaling) to scale down to 2 components in order to visualise the data in a 2-D vector space. For some reason, whenever I go to run the MDS on my data, my memory and CPU % increase quite highly and my kernel crashes, telling me I need to restart. Any one run into similar issues or know what may be causing this?

I'm using a MacBook Air, 125GB SSD, 4GB RAM and my development environment is the Spyder IDE.

Thanks

daniel3412
  • 327
  • 1
  • 4
  • 13

2 Answers2

3

I recommend running MDS with a 5% random sample. Looking through the scikit documentation, it seems most of the algorithms in the Manifold learning module have complexity of O(n^2). There no specific documentation for MDS, but comparing run times I can only assume MDS is n^2 or worse. Too much data, inefficient algorithm, small RAM = kernel crash

http://scikit-learn.org/stable/modules/manifold.html#manifold

misspelled
  • 306
  • 1
  • 5
  • +1 I have a similar issue. My kernel does not crash, but calculations do not finish even after a couple of hours . The only solution I found is to use a small sample as recommended above. – lanenok Apr 24 '15 at 15:56
3

Our current implementation of MDS is based on the smacof method which is too generic. A PCA / SVD might be much faster in many cases. This is planned as a pull request.

In the mean time you can directly use sklearn.decomposition.RandomizedPCA instead of the MDS class.

ogrisel
  • 39,309
  • 12
  • 116
  • 125