1

I have a dask.delayed function that takes an xarray.Dataarray as an argument and returns one as well.

I'm creating a few of these delayed tasks and pass them to client.compute using dask.distributed. Each call to compute returns a distributed.client.Future representing the data array that will be returned.

My question is:

Is there a way to build a "lazy" data array again from the future without loading the actual data from the worker? My intention is to built a second task graph based on the output from the first computation.

I've seen client.gather but this seems to pull all the data back to the client, which is not what I want.

Here's a small example:

import dask
from distributed import Client
import xarray as xr

# load example data
x = xr.tutorial.open_dataset("air_temperature")

# use first timestep
x_t0 = x.isel(time=0)

# delayed 'processing' function
@dask.delayed
def fun(x):
    return x*2

# init client
client = Client()

# compute on worker
future = client.compute(fun(x_t0))

# when done
print(future)
# <Future: finished, type: xarray.Dataset, key: fun-96cd56f4-4ed3-4eac-ade9-fe3f17e4b8c6>

## now how to get back to lazy xarray from future?
Val
  • 6,585
  • 5
  • 22
  • 52

1 Answers1

1

I dont know what you are exactly trying to achieve in the end. There might be better ways to do that than creating a new array from the future. That being said, this will create a new data array from your data: You have to not call compute to keep it lazy.

(if you want a dask array instead of a xarray array remove the xr.DataArray)

import dask
from distributed import Client
import xarray as xr

# load example data
x = xr.tutorial.open_dataset("air_temperature")

# use first timestep
x_t0 = x.isel(time=0)

# delayed 'processing' function
@dask.delayed
def fun(x):
    return x*2

# init client
client = Client()

# Create lazy xarray object from future:
import dask.array as da

new_ds = xr.DataArray(da.from_delayed(client.persist(fun(x_t0)), shape=x_t0.air.shape, meta='f8'), coords=x.coords)

EDIT: added client.persist to leave data on client

output: enter image description here

n4321d
  • 1,059
  • 2
  • 12
  • 31
  • Thanks, but I guess I should have been more explicit: I need the metadata (coordinates, etc.) from the dataarray, so just plainly wrapping a dask array into a `xarray.Dataarray` is not a solution for me. +1 for a workable solution – Val Jun 21 '21 at 22:02
  • Thanks, I hope this works: x['new'] = ((tuple(x_t0.coords)[:2], da.from_delayed(client.persist(fun(x_t0)), shape=x_t0.air.shape, meta='f8')) ? This will create a new parameter in your dataset, over the existing dimensions.. – n4321d Jun 21 '21 at 22:13
  • 1
    or you can use: new_ds = xr.DataArray(da.from_delayed(client.persist(fun(x_t0)), shape=x_t0.air.shape, meta='f8')); new_ds. assign_coords(x.coords) to create a new ds with the array from the delayed in it? – n4321d Jun 21 '21 at 22:16
  • 1
    the example in your second comment does pretty much what I want (and what I hoped there could be a native way). One note, you could put everything in one call if you call `xr.DataArray` with `coords=x.coords`. If you'd like to add this into your solution, I'd accept it as an answer! Thanks again! – Val Jun 22 '21 at 19:20
  • Ok great, I added it. Thanks for pointing that out! – n4321d Jun 22 '21 at 19:24