1

My goal is to compute a derivative of a moving window of a multidimensional dataset along a given dimension, where the dataset is stored as Xarray DataArray or DataSet.

In the simplest case, given a 2D array I would like to compute a moving difference across multiple entries in one dimension, e.g.:

data = np.kron(np.linspace(0,1,10), np.linspace(1,4,6) ).reshape(10,6)
T=3
reducedArray = np.zeros_like(data)
for i in range(data.shape[1]):
    if i < T:
        reducedArray[:,i] = data[:,i] - data[:,0]
    else:
        reducedArray[:,i] = data[:,i] - data[:,i-T]

where the if i <T condition ensures that input and output contain proper values (i.e., no nans) and are of identical shape.

Xarray's diff aims to perform a finite-difference approximation of a given derivative order using nearest-neighbours, so it is not suitable here, hence the question:

Is it possible to perform this operation using Xarray functions only?

The rolling weighted average example appears to be something similar, but still too distinct due to the usage of NumPy routines. I've been thinking that something along the lines of the following should work:

xr2DDataArray = xr.DataArray(
    data,
    dims=('x','y'),
    coords={'x':np.linspace(0,1,10), 'y':np.linspace(1,4,6)}
)
r = xr2DDataArray.rolling(x=T,min_periods=2)
r.reduce( redFn )

I am struggling with the definition of redFn here ,though.

Caveat The actual dataset to which the operation is to be applied will have a size of ~10GiB, so a solution that does not blow up the memory requirements will be highly appreciated!


Update/Solution

Using Xarray rolling

After sleeping on it and a bit more fiddling the post linked above actually contains a solution. To obtain a finite difference we just have to define the weights to be $\pm 1$ at the ends and $0$ else:

def fdMovingWindow(data, **kwargs):
    T = kwargs['T'];
    del kwargs['T'];
    weights = np.zeros(T)
    weights[0] = -1
    weights[-1] = 1
    axis = kwargs['axis']
    if data.shape[axis] == T:
        return np.sum(data * weights, **kwargs)
    else:
        return 0

r.reduce(fdMovingWindow, T=4)

alternatively, using construct and a dot product:

weights = np.zeros(T)
weights[0] = -1
weights[-1] = 1
xrWeights = xr.DataArray(weights, dims=['window'])
xr2DDataArray.rolling(y=T,min_periods=1).construct('window').dot(xrWeights)

This carries a massive caveat: The procedure essentially creates a list arrays representing the moving window. This is fine for a modest 2D / 3D array, but for a 4D array that takes up ~10 GiB in memory this will lead to an OOM death!

Simplicistic - memory efficient

A less memory-intensive way is to copy the array and work in a way similar to NumPy's arrays:

xrDiffArray = xr2DDataArray.copy()
dy = xr2DDataArray.y.values[1]  - xr2DDataArray.y.values[0] #equidistant sampling
for src in xr2DDataArray:
    if src.y.values < xr2DDataArray.y.values[0] + T*dy:
        xrDiffArray.loc[dict(y = src.y.values)] = src.values - xr2DDataArray.values[0]
    else:
        xrDiffArray.loc[dict(y = src.y.values)] = src.values - xr2DDataArray.sel(y = src.y.values - dy*T).values

This will produce the intended result without dimensional errors, but it requires a copy of the dataset.

I was hoping to utilise Xarray to prevent a copy and instead just chain operations that are then evaluated if and when values are actually requested.

A suggestion as to how to accomplish this will still be welcomed!

Nox
  • 324
  • 3
  • 18
  • I have never used xarray. Isn't it built on top of numpy? Don't numpy functions work on xarrays? – azelcer Feb 04 '22 at 18:04
  • To an extent NumPy's functions do work, but NumPy/SciPy functions will have to be wrapped in Xarray's caller routines (e.g. `reduce`), unless we want to ignore the additional properties and go straight for the raw data manipulation using `.data` attribute. – Nox Feb 07 '22 at 12:25
  • How will you assign the additional properties. For example, what are the coords of the difference? Midopoints? Start point (as it is now) – azelcer Feb 07 '22 at 14:15

1 Answers1

0

I have never used xarray, so maybe I am mistaken, but I think you can get the result you want avoiding using loops and conditionals. This is at least twice faster than your example for numpy arrays:

data = np.kron(np.linspace(0,1,10), np.linspace(1,4,6)).reshape(10,6)
reducedArray = np.empty_like(data)
reducedArray[:, T:] = data[:, T:] - data[:, :-T]
reducedArray[:, :T] = data[:, :T] - data[:, 0, np.newaxis]

I imagine the improvement will be higher when using DataArrays. It does not use xarray functions but neither depends on numpy functions. I am confident that translating this to xarray will be straightforward, I know that it works if there are no coords, but once you include them, you get an error because of the coords mismatch (coords of data[:, T:] and of data[:, :-T] are different). Sadly, I can't do better now.

azelcer
  • 1,383
  • 1
  • 3
  • 7
  • It's possible to do this with a `DataArray` - I have edited my post with an appropriate solution. But this requires a copy of the data. I was hoping to only keep the original (raw) data and chain evaluations of the arrays to then use them in interactive data exploration with HoloViews. – Nox Feb 06 '22 at 23:53
  • Well, in your first example you *did* create a new array (`reducedArray`), so I assumed you needed to keep both arrays. In that case, there was no memory allocated that was not needed. Now the question has changed completely. What is the higher amount of extra memory that you are willing to use for this? – azelcer Feb 07 '22 at 13:12
  • How does HoloViews chain with xarray? – azelcer Feb 07 '22 at 13:41
  • As I understand it HoloViews wraps an Xarray object without effecting a copy/evaluation of it (i.e., it's a wrapper with a reference to the `DataArray` or `DataSet`). The evaluation takes place only at the moment of plotting, which is great for me since I aim to plot only certain views of the very large dataset. – Nox Feb 09 '22 at 07:33
  • Well, that really does not explain how it chains with `xarray`. There is no mention of holoviews in the question. I'd suggest to open a new question, where you state precisely what you want to ask. – azelcer Feb 11 '22 at 00:01