2

I have a very big NetCDF file.

I tried to use the dask.array feature in python xarray module and specified the chunk size when I opened this data. It worked fine; however, when I tried to load the variables to memory using .load(), it was super slow.

I wonder is there any option (in xarray or other python module) to read in subset of a NetCDF file by providing indices of dimensions (lat, lon)? That way I can directly apply functions to the subset file without using the dask.array.

Tong Qiu
  • 123
  • 2
  • 10
  • Can you please provide a full example of code that runs more slowly than you expect, and a bit more detail about what the netCDF file looks like and how slow it is to load? – shoyer Apr 25 '18 at 18:46
  • Of course. Here is the example: the climate data is around 3 TB. I did not know where to upload such big files. The speed to .load() is slow. climate_dir = 'NCEP_data/resample_data/' tmax = xr.open_mfdataset(terra_climate_dir+'*tmax*.nc',chunks={'time':jobid,'lat':36,'lon':72}) tmax_pos = tmax.sel(lat=39.9042,lon=116.4074,method='nearest') tmax_pos_in_memory = tmax_pos.load() – Tong Qiu Apr 25 '18 at 19:04
  • The data is a resampled (0.05 degree) CRU-NCEP climate data (3600*7200) at daily temporal scale from 1992 to 2014. It takes half an hour to load the example data to memory. – Tong Qiu Apr 25 '18 at 21:26

2 Answers2

2

You can slice the data before loading the variable into memory.

ds = xr.open_dataset('path/to/file')
in_memory = ds.isel(x=slice(10, 1000)).load()
Keisuke FUJII
  • 1,306
  • 9
  • 13
1

This issue sounds similar to those discussed in https://github.com/pydata/xarray/issues/1396, but if you're using recent versions of dask that problem should be resolved.

You can potentially improve performance by avoiding explicit chunking until after indexing, e.g., just

tmax = xr.open_mfdataset(terra_climate_dir+'tmax.nc')
tmax_pos = tmax.sel(lat=39.9042,lon=116.4074,method='nearest').compute()

If this doesn't help, then the issue may related to your source data. For example, queries may be slow if data is accessed over a network mounted drive, or if data is loaded netCDF4 files with in-file chunking/compression (which requires reading full chunks into memory).

shoyer
  • 9,165
  • 1
  • 37
  • 55