4

I'm new to xarray and machine learning stuff.

So I have xarray dataset as follows:

<xarray.Dataset>
Dimensions:    (latitude: 721, longitude: 1440, time: 72)
Coordinates:
  * time       (time) datetime64[ns] 1950-01-01 1951-01-01 ... 2021-01-01
  * longitude  (longitude) float32 0.0 0.25 0.5 0.75 ... 359.25 359.5 359.75
  * latitude   (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
Data variables:
    z          (time, latitude, longitude) float32 49654.793 49654.793 ... 49654.793

Now I want to apply Linear Regression on each grid across time dimension, then I want to remove the regression value from the original value to remove the trend. Below is an example of one sample grid.

y = np.array(jan.z[:, 700, 700]) #single grid with all time
x = (np.arange(1950, len(y)+1949)).reshape(-1, 1) #72 time for x axis which will remain same for all grid

reg = LinearRegression().fit(x, y)

pred = reg.predict(x)
y = (y - (pred - y))

Now this is for just one grid I have such 721*14000 grid so for loop won't be the most optimized way to do it, is there more optimized way or some direct function to do so in xarray? I tried looking for similar thing but I'm not able to find which can solve my problem.

Chris_007
  • 829
  • 11
  • 29
  • 1
    Does this answer your question? [dask performance apply along axis](https://stackoverflow.com/questions/47314800/dask-performance-apply-along-axis) – Gabe Nov 15 '21 at 18:13

1 Answers1

4

xarray now has a polyfit method (see xr.DataArray.polyfit and xr.Dataset.polyfit). It integrates well with dask and performs reasonably well even for large arrays.

Example

In [2]: # generate a large random dask array
   ...: arr = dda.random.random((100, 1000, 1000), chunks=(-1, 100, 100)).astype('float32')

In [3]: da = xr.DataArray(arr, dims=['time', 'y', 'x'], coords=[pd.date_range('1950-01-01', freq='Y', periods=100), np.a
  ...: range(1000), np.arange(1000)])

In [4]: da
Out[4]:
<xarray.DataArray 'astype-5e6d16cb0909caa0098a5c92f0770801' (time: 100,
                                                             y: 1000, x: 1000)>
dask.array<astype, shape=(100, 1000, 1000), dtype=float32, chunksize=(100, 100, 100), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 1950-12-31 1951-12-31 ... 2049-12-31
  * y        (y) int64 0 1 2 3 4 5 6 7 8 ... 991 992 993 994 995 996 997 998 999
  * x        (x) int64 0 1 2 3 4 5 6 7 8 ... 991 992 993 994 995 996 997 998 999

Call polyfit along dimension time with deg=1 for a Linear least squared regression:


In [5]: da.polyfit('time', deg=1)
Out[5]:
<xarray.Dataset>
Dimensions:               (degree: 2, y: 1000, x: 1000)
Coordinates:
  * degree                (degree) int64 1 0
  * y                     (y) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
  * x                     (x) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
Data variables:
    polyfit_coefficients  (degree, y, x) float64 dask.array<chunksize=(2, 10, 1000), meta=np.ndarray>

Full workflow

First, assign your year values as a coordinate on the array, and use swap_dims to replace 'time' with the annual values (for proper scaling of the coefficients).

In [13]: year = np.arange(1950, 2050)

In [14]: da.coords['year'] = ('time', ), year

In [15]: da = da.swap_dims({'time': 'year'})

Now you can call polyfit to get the trend in [units/year]:

In [16]: coeffs = da.polyfit('year', deg=1)

xr.polyval can be used to predict the value using the resulting coefficients:

In [17]: pred = xr.polyval(da.year, coeffs.chunk({'x': 100, 'y': 100}))

In [18]: pred
Out[18]:
<xarray.Dataset>
Dimensions:               (y: 1000, x: 1000, year: 100)
Coordinates:
  * y                     (y) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
  * x                     (x) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
  * year                  (year) int64 1950 1951 1952 1953 ... 2047 2048 2049
Data variables:
    polyfit_coefficients  (year, y, x) float64 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>

And the result can be used to calculate the unexplained variation:

In [19]: unexplained = da - pred

In [20]: unexplained
Out[20]:
<xarray.Dataset>
Dimensions:               (y: 1000, x: 1000, year: 100)
Coordinates:
  * y                     (y) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
  * x                     (x) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
  * year                  (year) int64 1950 1951 1952 1953 ... 2047 2048 2049
    time                  (year) datetime64[ns] 1950-12-31 ... 2049-12-31
Data variables:
    polyfit_coefficients  (year, y, x) float64 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54