2

I am trying to store a dask array in a zarr file.

I have managed to do it when the dask array has a defined shape.


import dask
import dask.array as da
import numpy as np
from tempfile import TemporaryDirectory
import zarr


np_array = np.random.randint(1, 10, size=1000)
array = da.from_array(np_array)

with TemporaryDirectory() as tmpdir:
    delayed = da.to_zarr(array, url=tmpdir,
                         compute=False, component='/data')
    dask.compute(delayed)

     z_object = zarr.open_group(tmpdir, mode='r')

     assert np.all(np_array == z_object.data[:])

However if I have performed any operation with the dask array, the shape is lost and zarr complains about the Nans in the shape.

# this will fail

np_array = np.random.randint(1, 10, size=1000)
array = da.from_array(np_array)

array = array[array > 5]

with TemporaryDirectory() as tmpdir:
    delayed = da.to_zarr(array, url=tmpdir,
                         compute=False, component='/data')
    dask.compute(delayed)

    z_object = zarr.open_group(tmpdir, mode='r')

    assert np.all(np_array[np_array > 5] == z_object.data[:])

This is the raised error:

Traceback (most recent call last):
  File "/home/peio/devel/variation/variation6/variation6/tests/test_zarr.py", line 38, in <module>
    without_shape()
  File "/home/peio/devel/variation/variation6/variation6/tests/test_zarr.py", line 29, in without_shape
    compute=False, component='/data')
  File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/dask/array/core.py", line 2808, in to_zarr
    **kwargs
  File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/zarr/creation.py", line 120, in create
    chunk_store=chunk_store, filters=filters, object_codec=object_codec)
  File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/zarr/storage.py", line 323, in init_array
    object_codec=object_codec)
  File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/zarr/storage.py", line 343, in _init_array_metadata
    shape = normalize_shape(shape) + dtype.shape
  File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/zarr/util.py", line 58, in normalize_shape
    shape = tuple(int(s) for s in shape)
  File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/zarr/util.py", line 58, in <genexpr>
    shape = tuple(int(s) for s in shape)
ValueError: cannot convert float NaN to integer

Is there a way to store a dask array without known shape into a zarr file?

Thanks in advance!

  • I believe the usual workflow is to replace values you don't want with NaN instead of removing them from your working array via masking (see `da.where` instead). This way chunk sizes and shape are preserved. – djhoese Jul 23 '19 at 14:01
  • Or you could fully compute your dask array to a numpy array and then save it. – djhoese Jul 23 '19 at 14:02
  • 1
    Our arrays are like 40M rows per 1K, so we can not compute before saving to zarr. – Peio Ziarsolo Jul 24 '19 at 06:18

1 Answers1

2

Zarr expects that chunk shapes are uniform and known beforehand. Dask facilitates this currently by rechunking the array to be uniform. However array[array > 5] creates a Dask Array with unknown chunk shapes. So there is no way to rechunk it to be uniform beforehand as the needed information is not present. That said, we could explain this better.

One could workaround this by using Dask operations that return known chunk shapes (as David suggests). Alternatively one could determine the chunk shapes before storing (at the cost of computing). We could also discuss extending Zarr to handle this case, but that is a longer term solution.

jakirkham
  • 685
  • 5
  • 18
  • There are two questions that we do not completely understand. 1- Should all the chunks written to disk have to have the same chunk size? 2- Do we have to know beforehand the final total size of the array that we are going to store? – Peio Ziarsolo Jul 24 '19 at 06:38
  • All chunks should have a known size before writing to disk. They should also be uniform, but Dask handles that for you. – jakirkham Jul 24 '19 at 16:00