Dask store/read a sparse matrix that doesn't fit in memory

Question

I'm using sparse to construct, store, and read a large sparse matrix. I'd like to use Dask arrays to use its blocked algorithms features.

Here's a simplified version of what I'm trying to do:

file_path = './{}'.format('myfile.npz')
if os.path.isfile(file_path):
  # Load file with sparse matrix
  X_sparse = sparse.load_npz(file_path)
else:
  # All matrix elements are initially equal to 0
  coords, data = [], []
  X_sparse = sparse.COO(coords, data, shape=(88506, 1440000))
  # Create file for later retrieval
  sparse.save_npz(file_path, X_sparse)

# Create Dask array from matrix to allow usage of blocked algorithms
X = da.from_array(X_sparse, chunks='auto').map_blocks(sparse.COO)
return X

Unfortunately, the code above throws the following error when trying to use compute() with X: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.; but I cannot transform the sparse matrix to dense in memory, as it will result in an error.

Any ideas in how to accomplish this?

You can't. `Dask` works with numpy arrays. A scipy sparse matrix is not a numpy array. It's attributes may be arrays. For example the `coo` format uses 3 arrays, storing nonzero element data and indices. But `Dask` knows nothing about those. — hpaulj, Feb 20 '19 at 17:43
Actually, you can: https://docs.dask.org/en/latest/array-sparse.html — MRocklin, Feb 21 '19 at 02:07
I recommend providing an [MCVE](https://stackoverflow.com/help/mcve) — MRocklin, Feb 21 '19 at 02:08

score 2 · Answer 1 · answered Jun 06 '19 at 17:29

2

You can have a look at the following issue: https://github.com/dask/dask/issues/4523

Basically, sparse by intention prevents automatic conversion into a dense matrix. However, by setting the environment variable SPARSE_AUTO_DENSIFY=1 you can override this behavior. Nevertheless, this only solves the bug but does not accomplish your main goal.

What you would need to do is to split your file into multiple *.npz sparse matrices, load these with sparse in a delayed manner (see dask.delayed) and concatenate those into one large sparse Dask array.

I will have to implement something like this in the near future. IMHO this should be supported by Dask more natively...

answered Jun 06 '19 at 17:29

Hoeze

636
5
20

did you ever get around to implementing that feature of loading sparse matricies? I'm trying to save a dask Array with sparse.COO chunk types and this still seems like the only option. – Joe Mar 17 '21 at 21:17
No, I did not unfortunately. I ended up using dataframes for this use case. However, maybe TileDB is worth having a look for you. It simplifies storing sparse data a lot. – Hoeze Mar 18 '21 at 11:28
Thanks, I'll have a look! – Joe Mar 18 '21 at 15:30

score 0 · Answer 2 · answered Mar 27 '22 at 18:51

dask.array.from_array now supports COO and GCXS sparse arrays natively.

Using dask version '2022.01.0':

In [18]: # All matrix elements are initially equal to
    ...: coords, data = [], []
    ...: X_sparse = sparse.COO(coords, data, shape=(88506, 1440000))
    ...:
    ...: # Create Dask array from matrix to allow usage of blocked algorithms
    ...: X = dask.array.from_array(X_sparse, chunks="auto").map_blocks(sparse.COO)

In [19]: X
Out[19]: dask.array<COO, shape=(88506, 1440000), dtype=float64, chunksize=(4023, 4000), chunktype=sparse.COO>

See the dask docs on Sparse Arrays for more information.

Support for sparse arrays was added way back in 2017; stability & API support has been steadily improving ever since.

Dask store/read a sparse matrix that doesn't fit in memory

2 Answers2