I'm struggling to find a solution to the following problem. I have 3 different xarray dataarrays (ds_pr_cru; ds_tas_cru; ds_nep), each one containing a variable. I want to perform a liear ridge regression between the three variables on a specific dimension (time), having the first two xarrays as independent variables (X1 and X2) and the third one as dependent variable (Y). Now, I've seen from here that it is possible to use xr.apply_ufunc: indeed this works if I perform both ols and ridge regression, BUT having only X1 and Y, as below.
However, I cannot find the way to implement a second predictor in the regression (but of course I need more than two, eventually). Probably I overlooked or I'm structuring the analysis in the wrong way, but it feels strange that nobody have asked how to perform a multilinear regression like this. I can only find question involving multidimensional datasets, rather than how to use multiple datasets for regressing xarrays.
I tried the following, that works:
from sklearn import linear_model
def new_regress(x, y):
# Wrapper around scipy Ridge to use in apply_ufunc
ridge = linear_model.Ridge()
model = ridge.fit(x.reshape(-1, 1),y.reshape(-1, 1))
slope = model.coef_
return slope
# return a new DataArray
test = xr.apply_ufunc(new_regress,
ds_pr_cru.sel(lon = slice(-60,-50), lat = slice(-20,-10)), # spatial subset to speed-up example
ds_nep.sel(lon = slice(-60,-50), lat = slice(-20,-10)), # spatial subset to speed-up example
input_core_dims=[['time'], ['time']],
output_core_dims=[["parameter"]],
vectorize=True,
dask="parallelized",
output_dtypes=['float64'],
output_sizes={"parameter": 5},
)
print(test)
<xarray.DataArray (lat: 20, lon: 20, parameter: 1)>
array([[[24063.30273438],
[22831.70703125],
[20362.62695312],
[12655.3125 ],
[ 7683.64453125],
[ 9202.81933594],
[ 8463.17675781],
[ 9290.00878906],
[ 8692.11425781],
[10670.50390625],
[11069.19238281],
[10303.24902344],
[ 8438.82324219],
[ 8534.91503906],
[ 7817.84326172],
[ 7811.36621094],
[ 8132.60595703],
[ 7732.609375 ],
[ 8964.26367188],
[ 8884.00488281]],
...
[[ 3666.24145508],
[ 4158.28515625],
[ 4083.93017578],
...
Coordinates:
* lon (lon) float64 -59.75 -59.25 -58.75 -58.25 ... -51.25 -50.75 -50.25
* lat (lat) float64 -19.75 -19.25 -18.75 -18.25 ... -11.25 -10.75 -10.25
Dimensions without coordinates: parameter
If I try to similarly implement a multilinear regression:
from sklearn.linear_model import LinearRegression
def new_ridgereg(x1, x2, y):
# Wrapper around scipy linregress to use in apply_ufunc
ridge = linear_model.Ridge()
model = ridge.fit(x1.reshape(-1, 1),x2.reshape(-1, 1), y.reshape(-1, 1))
r2 = ridge.score(x1,x2,y)
ypred = ridge.predict(x1,x2)
coef = model.coef_
pval = stats.coef_pval(model,x1,x2,y)
return np.array([coef, pval, r2, ypred])
# return a new DataArray
test = xr.apply_ufunc(new_ridgereg,
ds_pr_cru.sel(lon = slice(-60,-50), lat = slice(-20,-10)),
ds_tas_cru.sel(lon = slice(-60,-50), lat = slice(-20,-10)),
ds_nep.sel(lon = slice(-60,-50), lat = slice(-20,-10)),
input_core_dims=[['time'], ['time'],['time']],
output_core_dims=[["parameter"]],
vectorize=True,
dask="parallelized",
output_dtypes=['float64'],
output_sizes={"parameter": 4},
)
test
but I get the following error:
ValueError Traceback (most recent call last)
g:\Shared drives\FLUXNET - FLUXCOM Unive\Code\TWS_semiarid.ipynb Cell 34 in 1
11 return np.array([coef, pval, r2, ypred])
13 # return a new DataArray
---> 14 test = xr.apply_ufunc(new_ridgereg,
15 ds_pr_cru.sel(lon = slice(-60,-50), lat = slice(-20,-10)),
16 ds_tas_cru.sel(lon = slice(-60,-50), lat = slice(-20,-10)),
17 ds_nep.sel(lon = slice(-60,-50), lat = slice(-20,-10)),
18 input_core_dims=[['time'], ['time'],['time']],
19 output_core_dims=[["parameter"]],
20 vectorize=True,
21 dask="parallelized",
22 output_dtypes=['float64'],
23 output_sizes={"parameter": 4},
24 )
File c:\Users\mastr\miniconda3\envs\geo_clim\lib\site-packages\xarray\core\computation.py:1204, in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, meta, dask_gufunc_kwargs, *args)
1202 # feed DataArray apply_variable_ufunc through apply_dataarray_vfunc
1203 elif any(isinstance(a, DataArray) for a in args):
-> 1204 return apply_dataarray_vfunc(
1205 variables_vfunc,
1206 *args,
1207 signature=signature,
1208 join=join,
...
1735 sample_weight.shape, (n_samples,)
1736 )
1737 )
ValueError: Sample weights must be 1D array or scalar
Thanks