0

I'm trying to fetch time series from geographical coordinates (single points) from Google ERA5 Reanalysis data. The dataset is following:

import xarray
data = xarray.open_zarr(
    'gs://gcp-public-data-arco-era5/co/single-level-reanalysis.zarr/',
    chunks={'time': 48},
    consolidated=True,
)
print("Model wind dataset size {:.1f} TiB".format(data.nbytes/(1024**4)))
print(data)

Model wind dataset size 28.0 TiB
<xarray.Dataset>
Dimensions:              (time: 374016, values: 542080)
Coordinates:
    depthBelowLandLayer  float64 ...
    entireAtmosphere     float64 ...
    latitude             (values) float64 dask.array<chunksize=(542080,), meta=np.ndarray>
    longitude            (values) float64 dask.array<chunksize=(542080,), meta=np.ndarray>
    number               int64 ...
    step                 timedelta64[ns] ...
    surface              float64 ...
  * time                 (time) datetime64[ns] 1979-01-01 ... 2021-08-31T23:0...
    valid_time           (time) datetime64[ns] dask.array<chunksize=(48,), meta=np.ndarray>
Dimensions without coordinates: values
Data variables: (12/38)
    cape                 (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    d2m                  (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    hcc                  (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    istl1                (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    istl2                (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    istl3                (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    ...                   ...
    tsn                  (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    u10                  (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    u100                 (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    v10                  (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    v100                 (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
    z                    (time, values) float32 dask.array<chunksize=(48, 542080), meta=np.ndarray>
Attributes:
    Conventions:               CF-1.7
    GRIB_centre:               ecmf
    GRIB_centreDescription:    European Centre for Medium-Range Weather Forec...
    GRIB_edition:              1
    GRIB_subCentre:            0
    history:                   2022-09-23T18:56 GRIB to CDM+CF via cfgrib-0.9...
    institution:               European Centre for Medium-Range Weather Forec...
    pangeo-forge:inputs_hash:  5f4378143e9f42402424280b63472752da3aa79179b53b...
    pangeo-forge:recipe_hash:  0c3415923e347ce9dac9dc5c6d209525f4d45d799bd25b...
    pangeo-forge:version:      0.9.1

What is the best way to interpolate a time series from single geographical point?

The methods like .sel and interp don't work:

data['cape'].interp(dict(latitude=60, longitude=20))

ValueError: Dimensions {'longitude', 'latitude'} do not exist. Expected one or more of Frozen({'values': 542080, 'time': 374016})
Pörripeikko
  • 839
  • 7
  • 6

1 Answers1

1

sel and interp won't work because the data are not located on a structured, regular grid. If you scatter plot the lon/lat coordinates, you will get :enter image description here

You have to deal with these unstructured coordinates. One way is to take the nearest neighbor value. You can do it with this kind of code for instance :

import numpy as np
import xarray as xr
from sklearn.neighbors import NearestNeighbors


class NearestInterpolator:
    def __init__(self, ds, x='longitude', y='latitude'):
        coords = np.c_[ds[x].values, ds[y].values]
        self.nn = NearestNeighbors().fit(coords)

    def interpolate(self, ds, coords, values='values'):
        index = self.nn.kneighbors(X=np.atleast_2d(coords), n_neighbors=1, return_distance=False).ravel()
        return ds.isel({values: index})


ds = xr.open_zarr("gs://gcp-public-data-arco-era5/co/single-level-reanalysis.zarr",
                      chunks={'time': 48},
                      consolidated=True)

ni = NearestInterpolator(ds=ds)
ds_interpolated = ni.interpolate(ds, [[2, 47], [5, 50]])
ds_interpolated['d2m']

>>> <xarray.DataArray 'd2m' (time: 374016, values: 2)>
>>> dask.array<getitem, shape=(374016, 2), dtype=float32, chunksize=(48, 2), chunktype=numpy.ndarray>
>>> Coordinates:
>>>     depthBelowLandLayer  float64 ...
>>>     entireAtmosphere     float64 ...
>>>     latitude             (values) float64 dask.array<chunksize=(2,), meta=np.ndarray>
>>>     longitude            (values) float64 dask.array<chunksize=(2,), meta=np.ndarray>
>>>     number               int64 ...
>>>     step                 timedelta64[ns] ...
>>>     surface              float64 ...
>>>   * time                 (time) datetime64[ns] 1979-01-01 ... 2021-08-31T23:0...
>>>     valid_time           (time) datetime64[ns] dask.array<chunksize=(48,), meta=np.ndarray>
>>> Dimensions without coordinates: values

This is not a perfect code, since it doesn't understand that near 0 longitude data and near 360 data are spatially close, but it works. If you want to go further with linear interpolation, you would have to compute a Delaunay triangulation, which can be expensive for these 542 080 coordinates.

cyril
  • 524
  • 2
  • 8