0

I'm struggling to find a solution to the following problem. I have 3 different xarray dataarrays (ds_pr_cru; ds_tas_cru; ds_nep), each one containing a variable. I want to perform a liear ridge regression between the three variables on a specific dimension (time), having the first two xarrays as independent variables (X1 and X2) and the third one as dependent variable (Y). Now, I've seen from here that it is possible to use xr.apply_ufunc: indeed this works if I perform both ols and ridge regression, BUT having only X1 and Y, as below.

However, I cannot find the way to implement a second predictor in the regression (but of course I need more than two, eventually). Probably I overlooked or I'm structuring the analysis in the wrong way, but it feels strange that nobody have asked how to perform a multilinear regression like this. I can only find question involving multidimensional datasets, rather than how to use multiple datasets for regressing xarrays.

I tried the following, that works:

from sklearn import linear_model

def new_regress(x, y):
    # Wrapper around scipy Ridge to use in apply_ufunc
    ridge = linear_model.Ridge()
    model = ridge.fit(x.reshape(-1, 1),y.reshape(-1, 1))
    slope = model.coef_
    return slope

# return a new DataArray
test = xr.apply_ufunc(new_regress,
                       ds_pr_cru.sel(lon = slice(-60,-50), lat = slice(-20,-10)),   # spatial subset to speed-up example
                       ds_nep.sel(lon = slice(-60,-50), lat = slice(-20,-10)),      # spatial subset to speed-up example
                       input_core_dims=[['time'], ['time']],
                       output_core_dims=[["parameter"]],
                       vectorize=True,
                       dask="parallelized",
                       output_dtypes=['float64'],
                       output_sizes={"parameter": 5},
                      )
print(test)

<xarray.DataArray (lat: 20, lon: 20, parameter: 1)>
array([[[24063.30273438],
        [22831.70703125],
        [20362.62695312],
        [12655.3125    ],
        [ 7683.64453125],
        [ 9202.81933594],
        [ 8463.17675781],
        [ 9290.00878906],
        [ 8692.11425781],
        [10670.50390625],
        [11069.19238281],
        [10303.24902344],
        [ 8438.82324219],
        [ 8534.91503906],
        [ 7817.84326172],
        [ 7811.36621094],
        [ 8132.60595703],
        [ 7732.609375  ],
        [ 8964.26367188],
        [ 8884.00488281]],
...
       [[ 3666.24145508],
        [ 4158.28515625],
        [ 4083.93017578],
...
Coordinates:
  * lon      (lon) float64 -59.75 -59.25 -58.75 -58.25 ... -51.25 -50.75 -50.25
  * lat      (lat) float64 -19.75 -19.25 -18.75 -18.25 ... -11.25 -10.75 -10.25
Dimensions without coordinates: parameter

If I try to similarly implement a multilinear regression:

from sklearn.linear_model import LinearRegression

def new_ridgereg(x1, x2, y):
    # Wrapper around scipy linregress to use in apply_ufunc
    ridge = linear_model.Ridge()
    model = ridge.fit(x1.reshape(-1, 1),x2.reshape(-1, 1), y.reshape(-1, 1))
    r2 = ridge.score(x1,x2,y)
    ypred = ridge.predict(x1,x2)
    coef = model.coef_
    pval = stats.coef_pval(model,x1,x2,y)
    return np.array([coef, pval, r2, ypred])

# return a new DataArray
test = xr.apply_ufunc(new_ridgereg, 
                       ds_pr_cru.sel(lon = slice(-60,-50), lat = slice(-20,-10)), 
                       ds_tas_cru.sel(lon = slice(-60,-50), lat = slice(-20,-10)),
                       ds_nep.sel(lon = slice(-60,-50), lat = slice(-20,-10)),
                       input_core_dims=[['time'], ['time'],['time']],
                       output_core_dims=[["parameter"]],
                       vectorize=True,
                       dask="parallelized",
                       output_dtypes=['float64'],
                       output_sizes={"parameter": 4},
                      )

test

but I get the following error:

ValueError                                Traceback (most recent call last)
g:\Shared drives\FLUXNET - FLUXCOM Unive\Code\TWS_semiarid.ipynb Cell 34 in 1
     11     return np.array([coef, pval, r2, ypred])
     13 # return a new DataArray
---> 14 test = xr.apply_ufunc(new_ridgereg, 
     15                        ds_pr_cru.sel(lon = slice(-60,-50), lat = slice(-20,-10)), 
     16                        ds_tas_cru.sel(lon = slice(-60,-50), lat = slice(-20,-10)),
     17                        ds_nep.sel(lon = slice(-60,-50), lat = slice(-20,-10)),
     18                        input_core_dims=[['time'], ['time'],['time']],
     19                        output_core_dims=[["parameter"]],
     20                        vectorize=True,
     21                        dask="parallelized",
     22                        output_dtypes=['float64'],
     23                        output_sizes={"parameter": 4},
     24                       )

File c:\Users\mastr\miniconda3\envs\geo_clim\lib\site-packages\xarray\core\computation.py:1204, in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, meta, dask_gufunc_kwargs, *args)
   1202 # feed DataArray apply_variable_ufunc through apply_dataarray_vfunc
   1203 elif any(isinstance(a, DataArray) for a in args):
-> 1204     return apply_dataarray_vfunc(
   1205         variables_vfunc,
   1206         *args,
   1207         signature=signature,
   1208         join=join,
...
   1735             sample_weight.shape, (n_samples,)
   1736         )
   1737     )

ValueError: Sample weights must be 1D array or scalar

Thanks

0 Answers0