5

Is it possible to create a dask array from a delayed value by specifying its shape with an other delayed value?

My algorithm won't give me the shape of the array until pretty late in the computation.

Eventually, I will be creating some blocks with shapes specified by the intermediate results of my computation, eventually calling da.concatenate on all the results (well da.block if it were more flexible)

I don't think it is too detrimental if I can't, but it would be cool if could.

Sample code

from dask import delayed
from dask import array as da
import numpy as np

n_shape = (3, 3)
shape = delayed(n_shape, nout=2)
d_shape = (delayed(n_shape[0]), delayed(n_shape[1]))
n = delayed(np.zeros)(n_shape, dtype=np.float)


# this doesn't work
# da.from_delayed(n, shape=shape, dtype=np.float)
# this doesn't work either, but I think goes a little deeper
# into the function call
da.from_delayed(n, shape=d_shape, dtype=np.float)
hmaarrfk
  • 417
  • 5
  • 10
  • I would agree that this would be a cool feature but I don't think it's possible. I *think* what you're trying to do is make the second array an observer of the first and this is not something I have seen within dask. It would be better to wrap the first array in an observable extension, then call the second array once the first has been populated. – mproffitt Jul 02 '18 at 22:37
  • You are correct. I think maybe what I'll do is wrap the concatenation of the result in an other arbitrary delayed object. – hmaarrfk Jul 02 '18 at 22:43
  • (pressed enter too soon) It doesn't make sense to access the resulting array as I don't really know it's bounds before the final computation. – hmaarrfk Jul 02 '18 at 22:43

1 Answers1

6

You can not provide a delayed shape, but you can state that the shape is unknown using np.nan as a value wherever you don't know a dimension

Example

import random
import numpy as np
import dask
import dask.array as da

@dask.delayed
def f():
    return np.ones((5, random.randint(10, 20)))  # a 5 x ? array

values = [f() for _ in range(5)]
arrays = [da.from_delayed(v, shape=(5, np.nan), dtype=float) for v in values]
x = da.concatenate(arrays, axis=1)

>>> x
dask.array<concatenate, shape=(5, nan), dtype=float64, chunksize=(5, nan)>

>>> x.shape
(5, np.nan)

>>> x.compute().shape
(5, 88)

Docs

See http://dask.pydata.org/en/latest/array-chunks.html#unknown-chunks

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • this gives me the right computation graph, but then gets suck because my array don't shape the same chunk size along 2 dimensions. I think this is an issue at this point and I'll post it on github at this point with an example. many claps @MRocklin – hmaarrfk Jul 02 '18 at 23:22
  • The nan solution worked for me. I'm distributed-processing large arrays with slightly different shapes, but rechunking of the combined array in a user-defined way cannot work anymore because of the shape uncertainty. – QGent Feb 17 '19 at 08:47