2

I have a relatively large NumPy array (nearly 300k rows and 20+ columns, though most values are 0) for which I need to compute a distance matrix using scikit-learn's pairwise_distances function.

Unfortunately, this process runs into a memory error unless I convert the input array to a sparse matrix. SciPy offers many sparse matrix classes and I do not know which one is best for this particular situation.

I found an SO answer that favors CSR or CSC, but I am unclear which one would be best to compute a distance matrix. Any suggestions are welcome!

Community
  • 1
  • 1
Gyan Veda
  • 6,309
  • 11
  • 41
  • 66
  • A distance matrix isn't sparse. Well, I suppose it could be sparse if you have a lot of duplicate points, but that is very rarely ever the case. – jme Dec 01 '14 at 19:38
  • The input array, not the distance matrix, is the one that I want to transform to a sparse matrix. – Gyan Veda Dec 01 '14 at 19:41
  • Ah, I see. But even then, the resulting distance matrix will have `n` choose 2 entries, which (for `n`=300,000) most certainly won't fit into memory. So converting the input array to a sparse array won't help much, I think. – jme Dec 01 '14 at 19:44
  • If you want to want to compute statistics over the pairwise distances, it might not make sense to keep the entire array in memory anyways. What are you doing with this matrix? – Hooked Dec 01 '14 at 19:51
  • The distance matrix will be an input to scikit-learn's `silhouette_score` function, which evaluates a clustering solution. I precompute the distance matrix because `pairwise_distances` can be parallelized, whereas `silhouette_score`, which computes a distance matrix in the background, cannot. – Gyan Veda Dec 01 '14 at 19:54

1 Answers1

2

CSR is ordered by rows, CSC is ordered by columns. So accessing rows would be faster with CSR and accessing columns would be faster using CSC. Since sklearn.metrics.pairwise.pairwise_distances uses as input, X, where the rows are instances and columns are attributes, it will be accessing rows from the sparse matrix. Hence it might be more efficient to use CSR.

Sid
  • 5,662
  • 2
  • 15
  • 18