4

The scikit documentation explains fit_transform can only be used for dense matrices, but I have a sparse matrix in csr format which I want to perform tsne on. The documentation says to use the fit method for sparse matrices, but this doesn't return the low dimensional embedding.

I appreciate I could use the .todense() method as in this question, but my data set is very large (0.4*10^6 rows and 0.5*10^4 columns) so wont fit in memory. Really, it would be nice to do this using sparse matrices. Is there a way to use scikit TSNE (or any other python implementation of TSNE) to reduce the dimensionality of a large sparse matrix and return the low dimensional embedding to then visualize?

PyRsquared
  • 6,970
  • 11
  • 50
  • 86

1 Answers1

4

From that same documentation:

It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples.

Use sklearn.decomposition.TruncatedSVD instead.

blacksite
  • 12,086
  • 10
  • 64
  • 109
  • From my understanding of TSNE, it can be used on any degree of dimensionality, and given the structure of my data, I feel TSNE would work better than most other dimensionality reduction algorithms. I suppose this is a limitation of scikit though. Thanks for link! – PyRsquared Sep 26 '17 at 13:18
  • 1
    @killerT2333 Have you seen UMAP (https://github.com/lmcinnes/umap)? Here, https://github.com/mmortazavi/UMAP_Nonlinear-Dimensionality-Reduction_Benchmark/blob/master/UMAP_Benchmark.ipynb, I compared UMAP with other methods including tSNE. – TwinPenguins Aug 13 '18 at 05:52