I have large NetCDF files (~23GB) which I extract point data from and load in as numpy arrays via xarray. In a loop, this process works perfectly - however when I try do it with a smaller single file (3.8GB), the code never finishes running.
I hesitate to ask this Q on stack because it's data-specific, so example code doesn't really help. However, the file in xarray has a structure 'Time, stid' - stid means station ID, this dimension just names the lat/lon pairs (see coordinates below) which are extracted from the global dataset.
<xarray.Dataset>
Dimensions: (TIME: 12745, stid: 22)
Coordinates:
* TIME (TIME) object 1982-01-01 00:00:00 ... 2016-12-01 00:00:00
LON (stid) float64 5.75 5.75 -89.25 -90.25 ... 9.75 -106.2 129.2
LAT (stid) float64 52.25 50.25 46.25 45.75 ... 46.75 53.75 62.25
* stid (stid) object 'NL-Loo' 'BE-Vie' 'US-Syv' ... 'CA-Oas' 'RU-Skp'
Data variables:
FAPAR_FILLED (TIME, stid) float64 ...
Attributes:
CDI: Climate Data Interface version 1.9.10 (https://mpimet.mpg.d...
Conventions: CF-1.0
Code I've tried:
I've tried a loop which extracts coordinates one station (stid) at a time, and I've also tried the code below which works perfectly fine for 1 or 2 stations, but doesn't finish running with 22. Why would this run for much larger datasets and not this? How could this be a memory problem when it is (single gridcell extraction * 22) from a 0.5° resolution global dataset?
data_load = os.path.join(address, 'fAPAR3g_v2_1982_2016_DAILY_del29feb_TEST.nc')
fapar_day = xr.open_dataset(data_load)
fapar_day = fapar_day.sel(LAT=crd_ix.lat, LON=crd_ix.lon, method="nearest")
fapar_day = xr.DataArray.to_masked_array(fapar_day.FAPAR_FILLED)
crd_ix
is a dictionary of the lat/lon pairs you can see in the lat, lon & stid coordinates. The only thing different between the working (not shown) & non-working (above) data is the working seems to load into xarray as float32.