6

I have a 4D xarray Dataset. I want to carry out a linear regression between two variables on a specific dimension (here time), and keep the regression parameters in a 3D array (the remaining dimensions). I managed to get the results I want by using this serial code, but it is rather slow:

# add empty arrays to store results of the regression
res_shape = tuple(v for k,v in ds[x].sizes.items() if k != 'year')
res_dims = tuple(k for k,v in ds[x].sizes.items() if k != 'year')
ds[sl] = (res_dims, np.empty(res_shape, dtype='float32'))
ds[inter] = (res_dims, np.empty(res_shape, dtype='float32'))
# Iterate in kept dimensions
for lat in ds.coords['latitude']:
    for lon in ds.coords['longitude']:
        for duration in ds.coords['duration']:
            locator = {'longitude':lon, 'latitude':lat, 'duration':duration}
            sel = ds.loc[locator]
            res = scipy.stats.linregress(sel[x], sel[y])
            ds[sl].loc[locator] = res.slope
            ds[inter].loc[locator] = res.intercept

How could I speed-up and parallelize this operation?

I understand that apply_ufunc might be an option (and could be parallelized with dask), but I did not managed to get the parameters right.

The following questions are related but without an answer:

Edit 2: move previous edit to an answer

LCT
  • 233
  • 1
  • 7
  • Aren't latitude and longitude meant to be considered together? Give two points (x1,y1) and (x2,y2), why do you iterate over (x1,y1), (x1,y2), (x2,y1), (x2,y2)? – John Zwinck Aug 30 '18 at 12:32
  • @JohnZwinck Two points could be at the same latitude, with different longitudes. – LCT Aug 30 '18 at 13:23
  • Yes. Shouldn't you iterate over (x,y) pairs, rather than the cartesian product of all x's with all y's? – John Zwinck Aug 30 '18 at 13:41

2 Answers2

6

The previous answer by LCT covers most of what should be said here, however I think it is possible to incorporate dask='parallelized' with multiple outputs like you get from scipy.stats.linregress.

The trick here is to stack the multiple outputs into one array and then output that, you'll also have to use the output_core_dims kwarg to specify that the DataArray output from the apply_ufunc() call will have an extra dimension now:

def new_linregress(x, y):
    # Wrapper around scipy linregress to use in apply_ufunc
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
    return np.array([slope, intercept, r_value, p_value, std_err])
# return a new DataArray
stats = xr.apply_ufunc(new_linregress, ds[x], ds[y],
                       input_core_dims=[['year'], ['year']],
                       output_core_dims=[["parameter"]],
                       vectorize=True,
                       dask="parallelized",
                       output_dtypes=['float64'],
                       output_sizes={"parameter": 5},
                      )

N.B. This approach currently only works with dask='parallelized' if you have dask<2.0, but it seems to work fine for multiple outputs if you have something else like dask='allowed'. Have a look at this Github issue for more detail.

Hope it helps!

Edit: I've been informed that the dask<2.0 issue has been rectified as long as you have xarray>=0.15.0 ! So can use dask='parallelized' now to speed things up. :)

Andrew Williams
  • 741
  • 8
  • 18
  • 1
    This answer is awesome! Was trying to do a similar linear regression problem along a time dimension and racking my head looking at http://xarray.pydata.org/en/v0.15.1/examples/apply_ufunc_vectorize_1d.html. The trick with putting everything into a numpy array to avoid the "NotImplementedError" is excellent, just need to split them out after :D – weiji14 May 27 '20 at 07:44
5

It is possible to apply scipy.stats.linregress (and other non-ufuncs) to the xarray Dataset using apply_ufunc() by passing vectorize=True like so:

# return a tuple of DataArrays
res = xr.apply_ufunc(scipy.stats.linregress, ds[x], ds[y],
        input_core_dims=[['year'], ['year']],
        output_core_dims=[[], [], [], [], []],
        vectorize=True)
# add the data to the existing dataset
for arr_name, arr in zip(array_names, res):
    ds[arr_name] = arr

Although still serial, apply_ufunc is around 36x faster than the loop implementation in this specific case.

However the parallelization with dask is still not implemented with multiple output like the one from scipy.stats.linregress:

NotImplementedError: multiple outputs from apply_ufunc not yet supported with dask='parallelized'

LCT
  • 233
  • 1
  • 7
  • I am trying this with sp.stats.ranksums but with no luck. I am getting a ValueError about the dimension that I fix (your 'year' which actually is 'year' for me too) because the two datasets have different values of years. – claude Sep 19 '18 at 19:32