0

I have large NetCDF files (~23GB) which I extract point data from and load in as numpy arrays via xarray. In a loop, this process works perfectly - however when I try do it with a smaller single file (3.8GB), the code never finishes running.

I hesitate to ask this Q on stack because it's data-specific, so example code doesn't really help. However, the file in xarray has a structure 'Time, stid' - stid means station ID, this dimension just names the lat/lon pairs (see coordinates below) which are extracted from the global dataset.

<xarray.Dataset>
Dimensions:       (TIME: 12745, stid: 22)
Coordinates:
  * TIME          (TIME) object 1982-01-01 00:00:00 ... 2016-12-01 00:00:00
    LON           (stid) float64 5.75 5.75 -89.25 -90.25 ... 9.75 -106.2 129.2
    LAT           (stid) float64 52.25 50.25 46.25 45.75 ... 46.75 53.75 62.25
  * stid          (stid) object 'NL-Loo' 'BE-Vie' 'US-Syv' ... 'CA-Oas' 'RU-Skp'
Data variables:
    FAPAR_FILLED  (TIME, stid) float64 ...
Attributes:
    CDI:          Climate Data Interface version 1.9.10 (https://mpimet.mpg.d...
    Conventions:  CF-1.0

Code I've tried:

I've tried a loop which extracts coordinates one station (stid) at a time, and I've also tried the code below which works perfectly fine for 1 or 2 stations, but doesn't finish running with 22. Why would this run for much larger datasets and not this? How could this be a memory problem when it is (single gridcell extraction * 22) from a 0.5° resolution global dataset?

data_load = os.path.join(address, 'fAPAR3g_v2_1982_2016_DAILY_del29feb_TEST.nc')
fapar_day = xr.open_dataset(data_load)
fapar_day = fapar_day.sel(LAT=crd_ix.lat, LON=crd_ix.lon, method="nearest")
fapar_day = xr.DataArray.to_masked_array(fapar_day.FAPAR_FILLED)

crd_ix is a dictionary of the lat/lon pairs you can see in the lat, lon & stid coordinates. The only thing different between the working (not shown) & non-working (above) data is the working seems to load into xarray as float32.

Ndharwood
  • 123
  • 3
  • 11

1 Answers1

0

I don't know if this will answer your question without a full example, but I did something similar recently and I formatted the sel line a little differently:

subset=xrdat.sel({'lat':xr.DataArray(ds['lat'],dims='z'),
                  'lon':xr.DataArray(ds['lon'],dims='z')},method='nearest')

It's based on this answer (not the top one): https://stackoverflow.com/a/62784295/12133280

Dharman
  • 30,962
  • 25
  • 85
  • 135
pasnik
  • 226
  • 1
  • 5
  • Thanks - these alternate methods didn't work unfortunately, but good idea (I've had success tinkering like this before with different versions of xarray on an HPC). I suspect it's a chunking/storage issue where the dataset is optimised for spatial access, not timeseries. Our HPC run finished the above in ~100min, not great but at least it finishes running – Ndharwood Mar 15 '22 at 18:47