How to efficiently shuffle a scipy sparse matrix, whatever its format?

Question

How can I shuffle the rows of a scipy sparse matrix?

There is a scikitlearn.utils.shuffle, but it returns a new matrix, and so for a very large sparse matrix, the shuffling is not done in-place but instead the matrix is replicated.

There is numpy.random.Generator.shuffle, but it seems to work only for CSR matrices.

How to efficiently shuffle the rows of a scipy sparse matrix, whatever the format used to store it in memory?

hpaulj · Accepted Answer · 2020-12-01T20:51:27.153

I'm consolidating my comments into one answer. It's not a solution, but editing is easier.

If you hope to find an efficient row shuffle regardless of sparse format, you have not studied the sparse matrix documentation enough. Only csr and lil store their data in row-oriented fashion.

I can imagine doing an in-place row shuffle with the lil format. While csr stores data in a row oriented manner, row shuffle will be more complicated, and difficult to do in-place.

Tracing through the scikit shuffle, I see it just comes down to matrix[index,:] (where index is a sampling without replacement). That's the same as in the CSR link. For what it's worth, CSR indexing actually uses matrix-multiplication, using a specially constructed 'extractor' matrix.

Shuffling lists is relatively efficient, in-place or not, since it just involves creating a new list of pointers/references to the row lists. Row shuffle of a dense numpy array requires copying all the data. It can be done in compiled code, but it still requires enough buffer space for a whole copy.

How to efficiently shuffle a scipy sparse matrix, whatever its format?

1 Answers1