Parallel computation with Dask and Xarray

Question

I have the following function

@dask.delayed
def load_ds(p):
    import xarray as xr
    multi_file_dataset = xr.open_mfdataset(p, combine='by_coords', concat_dim="time", parallel=True)
    mean = multi_file_dataset['tas'].mean(dim='time')
    return mean

which opens a set of NetCDF files (identified by path p) and calculates the mean value over time.

I'm trying to run in parallel the function over two differents paths (= datasets):

results = []
result1 = dask.delayed(load_ds)(path1)
results.append(result1)
result2 = dask.delayed(load_ds)(path2)
results.append(result2)
   
results = dask.compute(*results)

I've also tried

results = []
result1 = dask.delayed(load_ds)(path1)
results.append(result1)
result2 = dask.delayed(load_ds)(path2)
results.append(result2)
  
futures = dask.persist(*results)
results = dask.compute(*futures)

But, I noticed that the execution actually starts when I try to retrieve the results:

 print(results[0].values)

And again, when I retrieve the second one

 print(results[1].values)

What's wrong? Is there a way to retrieve the results object just once?

It is the principle of [`delayed`](https://docs.dask.org/en/latest/delayed.html) to run [lazily](https://en.wikipedia.org/wiki/Lazy_evaluation). So nothing is wrong. It is `results` as such that must be the argument of a `delayed`-decorated function. — keepAlive, Feb 09 '21 at 13:35
Is there a way to run the function in parallel over the two datasets? — Fab, Feb 09 '21 at 13:39

keepAlive · Accepted Answer · 2021-02-09T14:23:28.400

1

Given what you have done so far, what about:

delayed_task = dask.delayed(
    lambda L: (L[0].values, L[1].values)
)(results)

And "later",

tup = delayed_task.compute()

edited Feb 09 '21 at 14:23

answered Feb 09 '21 at 13:47

keepAlive

6,369
5
24
39

@Fab see edit. What does `print(tup)` return ? – keepAlive Feb 09 '21 at 14:11
Computation does not start... No tasks in the dask dashboard – Fab Feb 09 '21 at 14:13
1

I was able to test it. It seems ok! Is this approach better or equal to run `client.submit(load_ds,path2)` and then `results = client.gather(futures)`? – Fab Feb 09 '21 at 14:17
1

@Fab Actually, `~.compute` is *synchronous*, [meaning that it blocks the interpreter until it completes](https://distributed.dask.org/en/latest/client.html#dask). So it depends on whether you want to block things until the result is returned. Put differently I would go for asynchronous techniques, such as [`~.gather`](https://distributed.dask.org/en/latest/client.html#async-await-operation) indeed. – keepAlive Feb 09 '21 at 14:22

Parallel computation with Dask and Xarray

1 Answers1