9

Does Xarray support numpy computation functions such as polyfit? Or is there an efficient way to apply such functions to datasets?

Example: I want to calculate the slope of a line fitted to two variables (Temperature and Height), to calculate a lapse rate. I have a dataset (below) with these two variables with dimensions of (vertical, time, xgrid_0, ygrid_0).

<xarray.Dataset>
Dimensions:    (PressLev: 7, time: 48, xgrid_0: 685, ygrid_0: 485)
Coordinates:
    gridlat_0  (ygrid_0, xgrid_0) float32 44.6896 44.6956 44.7015 44.7075 ...
    gridlon_0  (ygrid_0, xgrid_0) float32 -129.906 -129.879 -129.851 ...
  * ygrid_0    (ygrid_0) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
  * xgrid_0    (xgrid_0) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
  * time       (time) datetime64[ns] 2016-08-15T01:00:00 2016-08-15T02:00:00 ...
  * PressLev   (PressLev) int64 0 1 2 3 4 5 6
Data variables:
    Temperature       (PressLev, time, ygrid_0, xgrid_0) float64 289.4 289.4 289.4 ...
    Height       (PressLev, time, ygrid_0, xgrid_0) float64 85.23 85.13 84.98 ...

If I extract the Temperature and Height for a given time, xgrid_0, ygrid_0; I can use the numpy.polyfit function.

ds_LR = ds.TMP_P0_L103_GST0 * 0 -9999 # Quick way to make dataarray with -9999 values but with correct dims/coords
for cts in np.arange(0,len(ds_UA.time)):
        for cx in ds_UA.xgrid_0.values:
                for cy in ds_UA.ygrid_0.values:
                        x_temp = ds_UA.Temperature[:,cts,cy,cx] # Grab the vertical profile of air temperature
                        y_hgt  = ds_UA.Height[:,cts,cy,cx] # Grab the vertical heights of air temperature values
                        s      = np.polyfit(y_hgt,x_temp,1) # Fit a line to the data
                        ds_LR[cts,cy,cx].values = s[0] # Grab the slope (first element)

But this is a slow and inefficient approach. Any suggestions on a better way to approach this?

nicway
  • 538
  • 1
  • 6
  • 16
  • numpy.polyfit expects one-dimensional values of the x coordinate and sets of y coordinate values, making a 2d input. So you can certainly reshape your many-dimensional array to match this necessity, then select the element you want and reshape again back to the original. – mdurant Aug 22 '16 at 21:07
  • i think the answer is in .reduce() but you have to play with kwargs – claude Apr 13 '18 at 18:46
  • 2
    I wanted to come back to this and direct you to xr.apply_ufunc() - which I tried to use yesterday. http://xarray.pydata.org/en/stable/generated/xarray.apply_ufunc.html it is not necessarily faster than doing a loop. (it's not in my case) especially if you use vectorize=True (which I have to use). I settled on using chain.from_iterable(zip(*dataset)) and do a loop. However you have to have the xgrid and ygrid as first two dimension (i used transpose method). I will try to write an answer as soon as I find the time, but you can give it a try. – claude Apr 19 '18 at 14:58

2 Answers2

12

This is becoming a pretty common question among xarray users as far as I can tell (myself included), and is closely related to this Github issue. Generally, a NumPy implementation of some function exists (in your case, np.polyfit()), but it's not clear how best to apply this calculation to every grid cell, possibly over multiple dimensions.

In a geoscience context, there are two main use cases, one with a simple solution and the other is more complicated:

(1) Easy case:

You have an xr.DataArray of temp, which is a function of (time, lat, lon) and you want to find the trend in time, in each gridbox. The easiest way to do this is by grouping the (lat, lon) coords into one new coord, grouping by that coord and then using the .apply() method.

Inspired by this Gist from Ryan Abernathy: <3

# Example data
da = xr.DataArray(np.random.randn(20, 180, 360),
                  dims=('time', 'lat', 'lon'),
                  coords={'time': np.linspace(0,19, 20), 
                  'lat': np.linspace(-90,90,180), 
                  'lon': np.linspace(0,359, 360)})

# define a function to compute a linear trend of a timeseries
def linear_trend(x):
    pf = np.polyfit(x.time, x, 1)
    # need to return an xr.DataArray for groupby
    return xr.DataArray(pf[0])

# stack lat and lon into a single dimension called allpoints
stacked = da.stack(allpoints=['lat','lon'])
# apply the function over allpoints to calculate the trend at each point
trend = stacked.groupby('allpoints').apply(linear_trend)
# unstack back to lat lon coordinates
trend_unstacked = trend.unstack('allpoints')

Downsides: This method becomes very slow for larger arrays and doesn't easily lend itself to other problems which may feel quite similar in their essence. This leads us to...

(2) Harder case (and the OP's question):

You have an xr.Dataset with variables temp and height, each of function of (plev, time, lat, lon) and you want to find the regression of temp against height (the lapse rate) for each (time, lat, lon) point.

The easiest way around this is to use xr.apply_ufunc(), which gives you some degree of vectorization and dask-compatibility. (Speed!)

# Example DataArrays
da1 = xr.DataArray(np.random.randn(20, 20, 180, 360),
                   dims=('plev', 'time', 'lat', 'lon'),
                   coords={'plev': np.linspace(0,19, 20), 
                   'time': np.linspace(0,19, 20), 
                   'lat': np.linspace(-90,90,180), 
                   'lon': np.linspace(0,359, 360)})

# Create dataset
ds = xr.Dataset({'Temp': da1, 'Height': da1})

As before, we create a function to compute the linear trend we need:

def linear_trend(x, y):
    pf = np.polyfit(x, y, 1)
    return xr.DataArray(pf[0])

Now, we can use xr.apply_ufunc() to regress the two DataArrays of temp and height against each other, along the plev dimension !

%%time
slopes = xr.apply_ufunc(linear_trend,
                        ds.Height, ds.Temp,
                        vectorize=True,
                        input_core_dims=[['plev'], ['plev']],# reduce along 'plev'
                        )

However, this approach is also quite slow and, as before, won't scale up well for larger arrays.

CPU times: user 2min 44s, sys: 2.1 s, total: 2min 46s
Wall time: 2min 48s

Speed it up:

To speed up this calculation, we can convert our height and temp data to dask.arrays using xr.DataArray.chunk(). This splits up our data into small, manageable chunks which we can then use to parallelize our calculation with dask=parallelized in our apply_ufunc().

N.B. You must be careful not to chunk along the dimension that you're applying the regression to!

dask_height = ds.Height.chunk({'lat':10, 'lon':10, 'time': 10})
dask_temp   = ds.Temp.chunk({'lat':10, 'lon':10, 'time': 10})
dask_height

<xarray.DataArray 'Height' (plev: 20, time: 20, lat: 180, lon: 360)>
dask.array<xarray-<this-array>, shape=(20, 20, 180, 360), dtype=float64, chunksize=(20, 10, 10, 10), chunktype=numpy.ndarray>
Coordinates:
  * plev     (plev) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * time     (time) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * lat      (lat) float64 -90.0 -88.99 -87.99 -86.98 ... 86.98 87.99 88.99 90.0
  * lon      (lon) int64 0 1 2 3 4 5 6 7 8 ... 352 353 354 355 356 357 358 359

Now, do the calculation again!

%%time
slopes_dask = xr.apply_ufunc(linear_trend,
                             dask_height, dask_temp,
                             vectorize=True,
                             dask='parallelized',
                             input_core_dims=[['plev'], ['plev']], # reduce along 'plev'
                             output_dtypes=['d'],
                             )
CPU times: user 6.55 ms, sys: 2.39 ms, total: 8.94 ms
Wall time: 9.24 ms

SIGNIFICANT SPEEDUP !

Hope this helps! I learnt a lot trying to answer it :)

Best

EDIT: As pointed out in the comments, to really compare the processing time between the dask and non-dask methods, you should use:

%%time
slopes_dask.compute()

which gives you a comparable wall time to the non-dask method.

However it's important to point out that operating lazily on the data (i.e. not loading it in until you absolutely need it) is much preferred for working with the kind of large datasets that you encounter in climate analysis. So I'd still suggest using the dask method, because then you can operate many different processes on the input array and each will take only a few ms, then only at the end do you have to wait for a few minutes to get your finished product out. :)

Andrew Williams
  • 741
  • 8
  • 18
  • 1
    To be able to compare the processing time between both cases, for the second one you must use `slopes_dask.compute()`, no? – CamiloEr Mar 06 '20 at 19:44
  • 1
    Yes, you're correct! And the `.compute()` part does take about a minute or so, however for most workflows you want to try and operate lazily on the data till the latest possible point, and so in my mind this was still an okay comparison. I'll add an edit though, it's a good point. :) – Andrew Williams Mar 06 '20 at 20:56
  • 1
    @CamiloEr see my edit above, the `.compute()` method takes roughly the same wall time as before, but if you operate *lazily* then you are able to do multiple operations on a dataset **before** loading the data into memory. And it's the loading into memory which is what takes a bit of time, but this time won't be (to first-order) massively increased by whether you do one or two or three lazy operations on the dataset first. – Andrew Williams Mar 06 '20 at 21:05