0

I have a function that works on a one dimensional array (timeseries) and returns a modified version of the array, plus a coefficient.

I'm using xarray.apply_ufunc to apply this function in parallel to my 3D dask array.

The desired output is a xarray dataset that has the modified 3D array, plus a new variable that contains the coefficient for each timeseries (so essentially a 2D array).

In the example below, the function fun simulates the behavior of the real function by returning the input data multiplied by two plus a random integer.

import xarray as xr
import numpy as np

# function
def fun(x):

    return x*2, np.random.randint(1000, size=(1,))

# test dataset
ds = xr.tutorial.open_dataset("air_temperature", chunks={"time": -1}).isel(time=slice(100,200))

print(ds)

# <xarray.Dataset>
# Dimensions:  (lat: 25, lon: 53, time: 100)
# Coordinates:
#   * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
#   * lon      (lon) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
#   * time     (time) datetime64[ns] 1800-01-01 1800-01-01 ... 1800-01-01
# Data variables:
#     air      (time, lat, lon) float32 dask.array<chunksize=(100, 25, 53), meta=np.ndarray>
# Attributes:
#     Conventions:  COARDS
#     title:        4x daily NMC reanalysis (1948)
#     description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
#     platform:     Model
#     references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Now to apply it to the dataset:


np.random.seed(42)

ds2 = xr.apply_ufunc(
    fun,
    ds,
    input_core_dims=[['time']],
    output_core_dims=[["time"],[]],
    vectorize=True,
    dask="parallelized"
)
    

This produces almost what I need. I get a tuple with two datasets, one the 3D array and the other the 2D array with the coefficients.

(<xarray.Dataset>
 Dimensions:  (lat: 25, lon: 53, time: 100)
 Coordinates:
   * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
   * lon      (lon) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
   * time     (time) datetime64[ns] 1800-01-01 1800-01-01 ... 1800-01-01
 Data variables:
     air      (lat, lon, time) float32 496.78 496.78 493.59998 ... 0.0 0.0 0.0,
 <xarray.Dataset>
 Dimensions:  (lat: 25, lon: 53)
 Coordinates:
   * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
   * lon      (lon) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
 Data variables:
     air      (lat, lon) int64 120 721 93 288 135 478 ... 380 497 881 102 485 814)

But what I really want is a combined dataset, that can be computed as a whole afterwards with dask. I've tried modifying input_core_dims, output_core_dims, output_sizes, ... etc but I couldn't get the result that I want.

Val
  • 6,585
  • 5
  • 22
  • 52
  • Thanks for the clear example! Just in terms of the intended output — how would the datasets be combined? They currently have an overlapping `air` variable – Maximilian Nov 11 '20 at 21:34
  • @Maximilian Thanks for your comment! The `air` variable in the second dataset (2d) can be renamed, that's just how it comes out from `apply_ufunc`. It's in reality a coefficy derived from `air`. Ideally the combination happens in `apply_ufunc`, the return value being a single dataset with 2 variables. – Val Nov 11 '20 at 22:31
  • I see! I think the best approach will be a wrapper function that does this merging — I don't think there's a way of doing that _within_ the `apply_ufunc`, and wrappers are fairly common: http://xarray.pydata.org/en/stable/generated/xarray.apply_ufunc.html – Maximilian Nov 12 '20 at 03:09

1 Answers1

2

Thanks to the helpful comments of @Maximilian, I have a working solution using a wrapper.

Since having a combined dataset straight out of apply_ufunc seems unlikely, I'll post this as an answer (I'll keep the possibility open for someone posting an improvement).

Using

def wrap(ds):

    ds2, coeff = xr.apply_ufunc(
        fun,
        ds.air,
        input_core_dims=[['time']],
        output_core_dims=[["time"],[]],
        vectorize=True,
        dask="parallelized"
    )
    
    ds_out = ds2.to_dataset()
    ds_out["coeff"] = coeff

    return ds_out

Gives me

wrap(ds)

#<xarray.Dataset>
#Dimensions:  (lat: 25, lon: 53, time: 100)
#Coordinates:
#  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
#  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
#  * time     (time) datetime64[ns] 2013-01-26 ... 2013-02-19T18:00:00
#Data variables:
#    air      (lat, lon, time) float32 dask.array<chunksize=(25, 53, 100), #meta=np.ndarray>
#    coeff    (lat, lon) int64 dask.array<chunksize=(25, 53), meta=np.ndarray>
Val
  • 6,585
  • 5
  • 22
  • 52