4

Suppose I generate an array with a shape that depends on some computation, such as:

>>> import dask.array as da
>>> a = da.random.normal(size=(int(1e6), 10))
>>> a = a[a.mean(axis=1) > 0]
>>> a.shape
(nan, 10)
>>> a.chunks
((nan, nan, nan, nan, nan), (10,))
>>> a.chunksize
(nan, 10)

The nan are expected. When I persist the result of the computation on the dask workers, I would assume that this missing metadata could have been retrieved but apparently this is not the case:

>>> a_persisted = a.persist()
>>> a_persisted.chunks
((nan, nan, nan, nan, nan), (10,))
>>> a_persisted.chunksize
(nan, 10)
>>> a_persisted.shape
(nan, 10)

If I try to force a rechunk I get:

>>> a_persisted.rechunk("auto")
Traceback (most recent call last):
  File "<ipython-input-26-31162de022a0>", line 1, in <module>
    a_persisted.rechunk("auto")
  File "/home/ogrisel/code/dask/dask/array/core.py", line 1647, in rechunk
    return rechunk(self, chunks, threshold, block_size_limit)
  File "/home/ogrisel/code/dask/dask/array/rechunk.py", line 226, in rechunk
    dtype=x.dtype, previous_chunks=x.chunks)
  File "/home/ogrisel/code/dask/dask/array/core.py", line 1872, in normalize_chunks
    chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks)
  File "/home/ogrisel/code/dask/dask/array/core.py", line 1949, in auto_chunks
    raise ValueError("Can not perform automatic rechunking with unknown "
ValueError: Can not perform automatic rechunking with unknown (nan) chunk sizes

What is the idiomatic way to update the metadata of my array with the actual size of the chunks that have already been computed on the worker?

I can compute them very cheaply with:

>>> dask.compute([chunk.shape for chunk in a_persisted.to_delayed().ravel()])
([(100108, 10), (99944, 10), (99545, 10), (99826, 10), (100099, 10)],)

My question is how to get a new dask array backed by the same chunks with some informative .shape, .chunk and .chunksize attributes (with no nans).

>>> dask.__version__
'1.1.0+9.gb1fef05'
ogrisel
  • 39,309
  • 12
  • 116
  • 125

2 Answers2

2

Looks like this will soon be solved internally in dask array (https://github.com/dask/dask/issues/3293). Until then, here is the workaround I use:

import dask.array as da
import dask.dataframe as dd
a = da.random.normal(size=(int(1e6), 10))
a = dd.from_dask_array(a[a.mean(axis=1) >0],columns=np.arange(a.shape[1])).to_dask_array(lengths=True).persist()
print(a.chunks)
print(a.shape)

((100068, 100157, 100279, 100446, 99706), (10,))
(500656, 10)
Rowan_Gaffney
  • 452
  • 5
  • 17
1

There isn't a good solution to this today, but there could be. I recommend raising an issue if one doesn't exist already. This is a commonly requested feature.

Edit: this is tracked here: https://github.com/dask/dask/issues/3293

MRocklin
  • 55,641
  • 23
  • 163
  • 235