0

I want to use dask.delayed in my work. Some of the functions I want to apply it to will work with xarray based data. I know xarray itself can use dask under the hood. For dask itself, it is recommended you don't pass dask arrays to delayed functions. I want to make sure that I don't hit any similar pitfalls by passing xarray objects to delayed function.

My current assumption is I should avoid creating dask-backed xarray datasets, which I think will be achieved by avoiding the chunk keyword and not using open_mfdataset. My question is ifs this restriction necessary and sufficient to avoid problems?

To give a concrete (but heavily simplified) example of the type of code I'm thinking to write, I'm thinking of something like this. The idea is that dask.delayed handles the parallelism, not xarray.

@dask.delayed
def pow2(ds):
   return ds**2

delayed_ds = dask.delayed(xr.open_dataset)('myfile.nc')
delayed_ds_squared = pow2(delayed_ds)
ds_squared = dask.compute(delayed_ds)
Caleb
  • 3,839
  • 7
  • 26
  • 35

1 Answers1

1

As long as your xarray dataset doesn't use dask as a backend itself, it can be easily passed as a parameter to dask.delayed.

The only pitfall is that compute() applied to a dask.delayed object by default uses the multiprocessing scheduler (unless a distributed.Client has been instantiated). Xarray data, being numpy-based, largely releases the GIL so it would heavily benefit from multithreading. You should either use dask.distributed or explicitly pass compute(scheduler="threaded").

crusaderky
  • 2,552
  • 3
  • 20
  • 28
  • I actually might be using dask with a k8s backend. I don't think the scheduler will be a concern. Each worker would be pretty independent I'd think. – Caleb Jul 24 '23 at 22:08