I am trying to use dask for an embarrassingly parallel workload.
Example would be, sum two large arrays element-by-element:
x1 = ...
x2 = ...
def func(a, b):
sleep(0.001 + len(a) * 1e-5)
return a + b + 1
I found [1] which seemingly answers this question, however, my function isn't that expensive, and can take an arbitrary amount of data to compute at once.
I would really like to chunk the data to avoid OOM, and dask.array seems to do that. I managed to get it "working" by wrapping func
with da.as_gufunc(signature="(),()->()", output_dtypes=float, vectorize=True)
- however that applies the function elementwise, and I need it to be on chunks.
I also tried dask.bag.from_sequence
- but that converts the arrays x1
and x2
into lists, which causes OOM (or horrible performance).
The closest I got to what I want is this:
from dask.diagnostics import ProgressBar
import zarr
import dask.array as da
d1 = da.ones((1024, 1024, 1024, 128),chunks=64)
d2 = da.ones((1024, 1024, 1024, 128),chunks=64)
d3 = d1 + d2 + 1
with ProgressBar():
d3.to_zarr("three.zarr")
however I need to replace d1 + d2 + 1
with my function.
[1] https://examples.dask.org/applications/embarrassingly-parallel.html