0

I was wondering whether there is a way to access and work with dimension labels when using the apply_ufunc function of xarray and @guvectorize of numba. From my understanding, having labeled dimensions is one of the advantages of xarray but I wasn't really able to use them within the functions.

In my case, I want to store valid values to fill NaN values along the dimension time for a certain max. amount of days, so the most convenient way would be using the xarray function ffill for propagating the values forward. But I also want to use the timestamp labels of the dimension time to keep track of the original date of the value that is being propagated. I was able doing that by iterating over each cell of my 3-d array (x,y,time), but performance-wise this approach is not working for my large dataset which is why I turned to apply_ufunc.

Some dummy data:

values = np.random.rand(2,2,10)*10
values.ravel()[np.random.choice(values.size, 30, replace=False)] = np.nan
lon = [[-99.83, -99.32], [-99.79, -99.23]]
lat = [[42.25, 42.21], [42.63, 42.59]]
ds = xr.Dataset({"value_day": (["x", "y", "time"], values),},
    coords={"lon": (["x", "y"], lon),"lat": (["x", "y"], lat), "time": pd.date_range("2018-01-01", periods=10),},)

<xarray.Dataset>
Dimensions:    (x: 2, y: 2, time: 10)
Coordinates:
    lon        (x, y) float64 -99.83 -99.32 -99.79 -99.23
    lat        (x, y) float64 42.25 42.21 42.63 42.59
  * time       (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-10
Dimensions without coordinates: x, y
Data variables:
    value_day  (x, y, time) float64 nan nan 7.261 nan ... nan nan 7.411 nan

I have a working function for propagating the values, but I don't know, if possible at all, how to access the time label and return it. I commented it where I would use the time label. It would also be fine having two functions (value propagating + time propagating) and use a wrapper afterwards.

@guvectorize(
    "(float64[:], float64[:])",
    "(n) -> (n)",
    nopython=True,
)
def bookkeeping(ds_xy, out): #(ds_xy, ds_xy_time, out)
    limit=5
    #variables for first iteration
    stored_value = ds_xy[0]  
    #stored_value_time = ds_xy_time[0]
    
    ds_days_collect = []
    #ds_days_time_collect = []
    
    counter_stored_days = 0
    #iterate over each date of timeseries
    for value_current in ds_xy:    
    #for value_current,time_current in zip(ds_xy,ds_xy_time):                            
        if np.isnan(value_current) == False:
            #overwrite stored value and corresponding date if new value for pixel
            stored_value = value_current
            #stored_value_time = time_current
        else:
            #keep stored value for bookkeeping up to 3 days
            if counter_stored_days > limit:
                #use nan value of day
                stored_value = value_current
                #stored_value_time = time_current
                counter_stored_days = 0
            else:              
                counter_stored_days += 1

        ds_days_collect.append(stored_value) 
        #ds_days_time_collect.append(stored_value_time)
        
    out[:] = np.asarray(ds_days_collect) #,ds_days_time_collect
    
def bookkeeping_ufunc(ds):
    return xr.apply_ufunc(
    bookkeeping,
    ds,
    input_core_dims= [["time"]],
    output_core_dims=[["time"]]       
) 

bookkeeping_ufunc(ds)
jstew
  • 53
  • 4
  • Numba doesn’t work with pandas or xarray indices - see the numba docs: https://numba.readthedocs.io/en/stable/user/5minguide.html. So unfortunately no :/ – Michael Delgado Apr 22 '22 at 06:35
  • Alright, thank you for that clarification Michael! Would there be a way only using `apply_ufunc` with the parameter `vectorize=True`? – jstew Apr 22 '22 at 09:29
  • [`xr.map_blocks`](https://xarray.pydata.org/en/stable/generated/xarray.map_blocks.html#xarray.map_blocks) maps a function which accets a DataArray across dask chunks. So this is probably what you're looking for. You could then tune the function which is called on each block as much as you can. But this will almost certainly be slower than anything you could write with guvectorize. The convenience of labeled coordinates unfortunately still comes at a cost! – Michael Delgado Apr 22 '22 at 16:42

0 Answers0