1

I would like to assign multiple small Dask arrays into parts of one large Dask array. My problem is similar to the one addressed in this post, expect my small arrays have a variable shape. My problem is also similar to the one addressed in this post, except I would like to assign them to a 2D location in the array that isn't sequential like it would be in a for loop, which also means that operations like stack and concatenate don't play nicely.

## Initialize large array
big_array = da.zeros([5, 6])  # I know this shape ahead of time

# Mock little arrays
aa_shape = dask.delayed((2,3))  # I don't know this shape ahead of time
aa = dask.delayed(1 * da.ones(aa_shape))
aa_loc = dask.delayed((slice(0,2), slice(0,3)))  # I don't know this location ahead of time
          
bb_shape = dask.delayed((3,3))
bb = dask.delayed(2 * da.ones(bb_shape))
bb_loc = dask.delayed((slice(0,3), slice(3,6)))

cc_shape = dask.delayed((3,3))
cc = dask.delayed(3 * da.ones(cc_shape))
cc_loc = dask.delayed((slice(2,5), slice(0,3)))

dd_shape = dask.delayed((2,3))
dd = dask.delayed(4 * da.ones(dd_shape))
dd_loc = dask.delayed((slice(3,5), slice(3,6)))

# Manually populate big array
big_array[aa_loc] = aa
big_array[bb_loc] = bb
big_array[cc_loc] = cc
big_array[dd_loc] = dd

big_array.compute()

Ideally the above code would output a big_array that looks like

array([[1., 1., 1., 2., 2., 2.],
       [1., 1., 1., 2., 2., 2.],
       [3., 3., 3., 2., 2., 2.],
       [3., 3., 3., 4., 4., 4.],
       [3., 3., 3., 4., 4., 4.]])

However, I get the error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [7], in <cell line: 21>()
     18 locs = [aa_loc, bb_loc, cc_loc, dd_loc]
     20 # Manually populate big array
---> 21 big_array[aa_loc] = aa
     22 big_array[bb_loc] = bb
     23 big_array[cc_loc] = cc

File ~/.conda-envs/daskenv202301/lib/python3.9/site-packages/dask/array/core.py:1893, in Array.__setitem__(self, key, value)
   1890 value = asanyarray(value)
   1892 out = "setitem-" + tokenize(self, key, value)
-> 1893 dsk = setitem_array(out, self, key, value)
   1895 meta = meta_from_array(self._meta)
   1896 if np.isscalar(meta):

File ~/.conda-envs/daskenv202301/lib/python3.9/site-packages/dask/array/slicing.py:1754, in setitem_array(out_name, array, indices, value)
   1752 array_shape = array.shape
   1753 value_shape = value.shape
-> 1754 value_ndim = len(value_shape)
   1756 # Reformat input indices
   1757 indices, implied_shape, reverse, implied_shape_positions = parse_assignment_indices(
   1758     indices, array_shape
   1759 )

File ~/.conda-envs/daskenv202301/lib/python3.9/site-packages/dask/delayed.py:591, in Delayed.__len__(self)
    589 def __len__(self):
    590     if self._length is None:
--> 591         raise TypeError("Delayed objects of unspecified length have no len()")
    592     return self._length

TypeError: Delayed objects of unspecified length have no len()

If I modify the code so that the little arrays use simple Dask arrays instead of delayed objects, the code runs successfully. Does anyone have suggestions on how to approach this? Thanks for the help!

rybchuk
  • 11
  • 1
  • Dask.array and dask.delayed are not interchangeable like this. You do need to use dask.array consistently throughout. Additionally, the dask scheduler does need to know where in the array you want to insert the data to it can assign the job to the correct worker. So having the indexes be the result of a dask operation like this won’t work. – Michael Delgado Feb 26 '23 at 19:01
  • I think your best bet would be to mask the whole array using a variable mask for the assignment, e.g. `big_array = da.where(aa_mask, aa, big_array)` – Michael Delgado Feb 26 '23 at 19:03
  • Thanks for the feedback @MichaelDelgado! For more context, I came across this problem while trying to develop a way to read one binary dataset using a Dask distributed scheduler. The [Dask docs](http://127.0.0.1:8889/lab?token=bb901ae38d0a4d4ad39b2b1c90a8760a5eb5b9df3efdfee6) have some suggestions using a combination of memory mapping and `dask.delayed`, which is why I'm using small `delayed` arrays in my toy problem above. I like your suggestion of using `.where`, but I think it would break my code when deployed for my end problem, because it would result in thousands of TB-sized arrays. – rybchuk Feb 27 '23 at 15:35
  • Hmm. You could try using a [sparse backend](https://docs.dask.org/en/stable/array-sparse.html#sparse-arrays) to collapse the insertions first and then do a single insertion into the dense array. Depends on how dense the final insertion would be I suppose? – Michael Delgado Feb 27 '23 at 16:33
  • The docs link you posted doesn’t work for me - looks like it’s going to localhost? But yeah you can use them together but you have to be careful to make sure the delayed objects get expanded to actual dask arrays, e.g. by submitting them as dask function args or converting them using `dask.array.from_delayed`. See the [best practices guide](https://docs.dask.org/en/stable/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections) - you shouldn’t mix and match - the suggestion to use delayed is to convert the array into only delayed objects, not a mix. – Michael Delgado Feb 27 '23 at 16:57
  • 1
    Oh whoops, sorry about that, [this](https://docs.dask.org/en/stable/array-creation.html#memory-mapping) is that docs link. I didn't realize there was a "best practices" on this, thanks for pointing that my way! Also, I now think I'm asking Dask to try and handle a case that is overly general. If I strip out the `delayed` calls and force some simplifying assumptions (e.g., `aa.shape() == bb.shape()` and pre-sorting arrays), I can get `big_array` populated through `dask.array.block` https://docs.dask.org/en/stable/generated/dask.array.block.html. Thanks for the help @MichaelDelgado! – rybchuk Feb 27 '23 at 17:37
  • Nice! Yeah that seems like a great approach. Feel free to post your solution - is probably something that would be of general interest. Cheers! – Michael Delgado Feb 27 '23 at 17:54
  • Absolutely! I'm going to post an update to the Pangeo forums with my code, and I'll link that post once it goes up. – rybchuk Feb 27 '23 at 22:46
  • Here's that update I mentioned earlier: https://discourse.pangeo.io/t/experience-with-yt-project-interface-or-amrex-formatted-data/3142/2?u=rybchuk – rybchuk Mar 08 '23 at 18:23

0 Answers0