1

I have the following function

@dask.delayed
def load_ds(p):
    import xarray as xr
    multi_file_dataset = xr.open_mfdataset(p, combine='by_coords', concat_dim="time", parallel=True)
    mean = multi_file_dataset['tas'].mean(dim='time')
    return mean

which opens a set of NetCDF files (identified by path p) and calculates the mean value over time.

I'm trying to run in parallel the function over two differents paths (= datasets):

results = []
result1 = dask.delayed(load_ds)(path1)
results.append(result1)
result2 = dask.delayed(load_ds)(path2)
results.append(result2)
   
results = dask.compute(*results)

I've also tried

results = []
result1 = dask.delayed(load_ds)(path1)
results.append(result1)
result2 = dask.delayed(load_ds)(path2)
results.append(result2)
  
futures = dask.persist(*results)
results = dask.compute(*futures)

But, I noticed that the execution actually starts when I try to retrieve the results:

 print(results[0].values)

And again, when I retrieve the second one

 print(results[1].values)

What's wrong? Is there a way to retrieve the results object just once?

Fab
  • 1,145
  • 7
  • 20
  • 40
  • It is the principle of [`delayed`](https://docs.dask.org/en/latest/delayed.html) to run [lazily](https://en.wikipedia.org/wiki/Lazy_evaluation). So nothing is wrong. It is `results` as such that must be the argument of a `delayed`-decorated function. – keepAlive Feb 09 '21 at 13:35
  • Is there a way to run the function in parallel over the two datasets? – Fab Feb 09 '21 at 13:39

1 Answers1

1

Given what you have done so far, what about:

delayed_task = dask.delayed(
    lambda L: (L[0].values, L[1].values)
)(results)

And "later",

tup = delayed_task.compute()

keepAlive
  • 6,369
  • 5
  • 24
  • 39
  • @Fab see edit. What does `print(tup)` return ? – keepAlive Feb 09 '21 at 14:11
  • Computation does not start... No tasks in the dask dashboard – Fab Feb 09 '21 at 14:13
  • 1
    I was able to test it. It seems ok! Is this approach better or equal to run `client.submit(load_ds,path2)` and then `results = client.gather(futures)`? – Fab Feb 09 '21 at 14:17
  • 1
    @Fab Actually, `~.compute` is *synchronous*, [meaning that it blocks the interpreter until it completes](https://distributed.dask.org/en/latest/client.html#dask). So it depends on whether you want to block things until the result is returned. Put differently I would go for asynchronous techniques, such as [`~.gather`](https://distributed.dask.org/en/latest/client.html#async-await-operation) indeed. – keepAlive Feb 09 '21 at 14:22