How to collocate large datasets most efficiently, comparing time, latitude (x), and longitude (y)

Question

I would like some help trying to efficiently collocate two datasets, one is let's say observations of rainfall, in terms of datetime, latitude and longitude. The other is meteorological data e.g. reanalysis given also in terms of datetime, latitude and longitude. Below I provide two example random df and xarrays and then collocate them.

from numpy.random import rand
from random import randint
from datetime import datetime, timedelta
import xarray as xr
import numpy as np

#create example data of the dataframe we want to collocate with the meterological data

datetimes = pd.date_range(start='2002-01-01 10:00:00', end='2002-01-05 10:00:00', freq='H')
rainfall = rand(len(datetimes))
latitudes = [randint(0, 90) for p in range(0, len(datetimes))]
longitudes = [randint(0, 180) for p in range(0, len(datetimes))]
df_obs = pd.DataFrame({'datetime':datetimes, 'rainfall':rainfall, 'latitude':latitudes,
                       'longitude':longitudes})

#create an xarray which is the example met data

met_type = np.ones((720, 1440))
rainfall = rand(len(datetimes))
met_list = [x*met_type for x in rainfall]

def produce_xarray(met_list, datetimes, met_type='rain', datetime_var="datetime"): [![enter image description here][1]][1]
    if isinstance(datetimes[0], datetime) == False:
        dates = [datetime.strptime(x, '%Y%m') for x in datetimes]
    if isinstance(datetimes[0], datetime) == True:
        dates = datetimes
    met_list_dstack = np.dstack(met_list)
    lats = np.arange(90, -90, -0.25)
    lons = np.arange(-180,180, 0.25)
    ds = xr.Dataset(data_vars={met_type:(["latitude","longitude",datetime_var], met_list_dstack),}, 
                    coords={"latitude": lats, "longitude": lons, datetime_var: dates})
    ds[met_type].attrs["units"] = "g "+str(met_type)+"m$^{-2}$"
    return ds

xr_met = produce_xarray(met_list, datetimes, datetime_var="datetime")

#now I wish to collocate the data as quickly as possible, as my datasets are huge - 
#here I have a function which finds the closest value using the datetime, latitude and longitude 
#the I apply this function to the df of my random observations

var ='rain'

def find_value_lat_lon(lat, lon, traj_datetime):
    array = xr_met[var].sel(latitude=lat, longitude=lon, datetime=traj_datetime, method='nearest').squeeze()
    value = array.values
    return value

def append_var_columnwise(df, var_name):
    df = df.copy()
    df.loc[:, var_name] = df[['latitude', 'longitude', 'datetime']].apply(lambda x: find_value_lat_lon(*x), 
                                                                                      axis=1)
    return df

print(df_obs)

print(xr_met)

df_obs = append_var_columnwise(df_obs, var_name='rain_met')

print(df_obs)

The final output is shown in the picture - whereby the df has an additional column with 'rain met' - for 97 data points this takes 212ms.

score 1 · Answer 1 · answered Jan 04 '23 at 17:58

1

I don't know that it is any faster, but .sel supports vectorized indexing (see https://docs.xarray.dev/en/stable/user-guide/indexing.html#vectorized-indexing : the last example in this section is a 2D version of your code)

df.loc[:, var_name] = xr_met[var].sel(
    latitude=xr.DataArray(df['latitude']),
    longitude=xr.DataArray(df['longitude']),
    datetime=xr.DataArray(df['datetime']),
    method='nearest')

answered Jan 04 '23 at 17:58

Peter

14,559
35
55

thanks @peter, I'll try that for a slice of the data and see which one preforms quickest – Dominic Jan 04 '23 at 21:10
for 1000 data points it was 4.01s for yours, and 6.64s for mine. for 10000, 4.21s yours & 31.8s! Crazy improvement. Thank you – Dominic Jan 04 '23 at 21:16

How to collocate large datasets most efficiently, comparing time, latitude (x), and longitude (y)

1 Answers1