1

I am trying to write a large dask array (46 GB with 124 -- 370 MB chunks) to a zarr file using dask. If my dask array was named dask_data, then a simple dask_data.to_zarr("my_zarr.zarr") would work. But from what I understand, this is a synchronous, CPU bound process.

What I would like to do is to use parallelism with much of the work allocated to a Quadro GV100 GPU. I tried to convert the numpy.ndarray to a cupy.ndarray via dask_data_cupy = dask_data.map_blocks(cupy.asarray) and write this out to a zarr file, but I receive:

ValueError: object __array__ method not producing an array (and frankly, I do not see a performance boost either).

How could I go about using a GPU to parallelize writing a dask array to a zarr file?

Thanks!

irahorecka
  • 1,447
  • 8
  • 25

2 Answers2

2

But from what I understand, this is a synchronous, CPU bound process.

This is probably not true, your bottleneck is likely the storage device. In any case, each chunk is written to a separate file, and in parallel across threads and/or processes (depending on your setup). That is the whole point of the design of zarr, that an application can interact with each chunk independently.

You may be CPU bound if you choose to use various encodings of compression; however these do not necessarily lend themselves to GPU operation.

In short, unless your data is already generated on the GPU, I would be surprised if transferring it to the GPU for processing before writing it to files is worth the while. If there were a function to directly read/write cupy arrays to zarr, and your were also processing on the GPU, it would be different - but I don't believe there is.

mdurant
  • 27,272
  • 5
  • 45
  • 74
1

I think you would need to add dask_data.map_blocks(cupy.asnumpy) before calling to_zarr.

CuPy tries to make sure that the user intended to do a device to host transfer (as these can be expensive). So intentionally raises when numpy.asarray is called on a CuPy array (as would happen during this write).

jakirkham
  • 685
  • 5
  • 18