0

I'm trying to accelerate my numpy code using dask. Following is a part of my numpy code

arr_1 = np.load('<arr1_path>.npy')
arr_2 = np.load('<arr2_path>.npy')
arr_3 = np.load('<arr3_path>.npy')

arr_1 = np.concatenate((arr_1, arr_2[:,:,np.newaxis]),axis = 2)
arr_1_half = totaldata.shape[0]//2
arr_4 = arr_3[:half]
[r,c] = np.where(arr_4 == True)
[rn,cn] = np.where(arr_4 == False)

print(len(r))

This prints valid results and is working fine. However, following dask equivalent

arr_1 = da.from_zarr('<arr1_path>.zarr')
arr_2 = da.from_zarr('<arr2_path>.zarr')
arr_3 = da.from_zarr('<arr3_path>.zarr')

arr_1 = da.concatenate((arr_1, arr_2[:,:,np.newaxis]),axis = 2)
arr_1_half = totaldata.shape[0]//2
arr_4 = arr_3[:half]
[r,c] = da.where(arr_4 == True)
[rn,cn] = da.where(arr_4 == False)

print(len(r)) # <----- Error: float' object cannot be interpreted as an integer

results in r as

dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,), chunktype=numpy.ndarray>

and thus the above mentioned error. Since dask arrays are lazily evaluated, do I have to explicitly call compute() or similar somewhere? Or am I missing something basic? Any help will be appreciated.

F Baig
  • 339
  • 1
  • 4
  • 13
  • What version of zarr? See https://zarr.readthedocs.io/en/stable/release.html#release-2-11-1 – Josh Mar 17 '22 at 00:37
  • Installed it yesterday using `conda`, just checked it's the latest version 2.11.1. Also, I ran the same `numpy` code using `zarr.load()` instead of `np.load()` and it seems to be working fine. The issue I think is with `dask` instead of `zarr` – F Baig Mar 17 '22 at 01:06

1 Answers1

1

The array you've constructed with da.where has unknown chunk sizes, which can happen whenever the size of an array depends on lazy computations that haven’t yet been performed. Unknown values within shape or chunks are designated using np.nan rather than an integer, which is why you see the ValueError (this error message was improved in the last few months). The solution is to use compute_chunk_sizes:

import dask.array as da
x = da.from_array(np.random.randn(100), chunks=20)
y = x[x > 0]
# len(y) # ValueError: Cannot call len() on object with unknown chunk size.
y.compute_chunk_sizes() # modifies y in-place
len(y)
scj13
  • 306
  • 1
  • 5