Suppose I generate an array with a shape that depends on some computation, such as:
>>> import dask.array as da
>>> a = da.random.normal(size=(int(1e6), 10))
>>> a = a[a.mean(axis=1) > 0]
>>> a.shape
(nan, 10)
>>> a.chunks
((nan, nan, nan, nan, nan), (10,))
>>> a.chunksize
(nan, 10)
The nan
are expected. When I persist the result of the computation on the dask workers, I would assume that this missing metadata could have been retrieved but apparently this is not the case:
>>> a_persisted = a.persist()
>>> a_persisted.chunks
((nan, nan, nan, nan, nan), (10,))
>>> a_persisted.chunksize
(nan, 10)
>>> a_persisted.shape
(nan, 10)
If I try to force a rechunk I get:
>>> a_persisted.rechunk("auto")
Traceback (most recent call last):
File "<ipython-input-26-31162de022a0>", line 1, in <module>
a_persisted.rechunk("auto")
File "/home/ogrisel/code/dask/dask/array/core.py", line 1647, in rechunk
return rechunk(self, chunks, threshold, block_size_limit)
File "/home/ogrisel/code/dask/dask/array/rechunk.py", line 226, in rechunk
dtype=x.dtype, previous_chunks=x.chunks)
File "/home/ogrisel/code/dask/dask/array/core.py", line 1872, in normalize_chunks
chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks)
File "/home/ogrisel/code/dask/dask/array/core.py", line 1949, in auto_chunks
raise ValueError("Can not perform automatic rechunking with unknown "
ValueError: Can not perform automatic rechunking with unknown (nan) chunk sizes
What is the idiomatic way to update the metadata of my array with the actual size of the chunks that have already been computed on the worker?
I can compute them very cheaply with:
>>> dask.compute([chunk.shape for chunk in a_persisted.to_delayed().ravel()])
([(100108, 10), (99944, 10), (99545, 10), (99826, 10), (100099, 10)],)
My question is how to get a new dask array backed by the same chunks with some informative .shape
, .chunk
and .chunksize
attributes (with no nans).
>>> dask.__version__
'1.1.0+9.gb1fef05'