I have a function that works on a one dimensional array (timeseries) and returns a modified version of the array, plus a coefficient.
I'm using xarray.apply_ufunc
to apply this function in parallel to my 3D dask
array.
The desired output is a xarray
dataset that has the modified 3D array, plus a new variable that contains the coefficient for each timeseries (so essentially a 2D array).
In the example below, the function fun
simulates the behavior of the real function by returning the input data multiplied by two plus a random integer.
import xarray as xr
import numpy as np
# function
def fun(x):
return x*2, np.random.randint(1000, size=(1,))
# test dataset
ds = xr.tutorial.open_dataset("air_temperature", chunks={"time": -1}).isel(time=slice(100,200))
print(ds)
# <xarray.Dataset>
# Dimensions: (lat: 25, lon: 53, time: 100)
# Coordinates:
# * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
# * lon (lon) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
# * time (time) datetime64[ns] 1800-01-01 1800-01-01 ... 1800-01-01
# Data variables:
# air (time, lat, lon) float32 dask.array<chunksize=(100, 25, 53), meta=np.ndarray>
# Attributes:
# Conventions: COARDS
# title: 4x daily NMC reanalysis (1948)
# description: Data is from NMC initialized reanalysis\n(4x/day). These a...
# platform: Model
# references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
Now to apply it to the dataset:
np.random.seed(42)
ds2 = xr.apply_ufunc(
fun,
ds,
input_core_dims=[['time']],
output_core_dims=[["time"],[]],
vectorize=True,
dask="parallelized"
)
This produces almost what I need. I get a tuple with two datasets, one the 3D array and the other the 2D array with the coefficients.
(<xarray.Dataset>
Dimensions: (lat: 25, lon: 53, time: 100)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
* time (time) datetime64[ns] 1800-01-01 1800-01-01 ... 1800-01-01
Data variables:
air (lat, lon, time) float32 496.78 496.78 493.59998 ... 0.0 0.0 0.0,
<xarray.Dataset>
Dimensions: (lat: 25, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
Data variables:
air (lat, lon) int64 120 721 93 288 135 478 ... 380 497 881 102 485 814)
But what I really want is a combined dataset, that can be computed as a whole afterwards with dask
. I've tried modifying input_core_dims
, output_core_dims
, output_sizes
, ... etc but I couldn't get the result that I want.