read subset of a single NetCDF file using slices of dimensions

Question

I have a very big NetCDF file.

I tried to use the dask.array feature in python xarray module and specified the chunk size when I opened this data. It worked fine; however, when I tried to load the variables to memory using .load(), it was super slow.

I wonder is there any option (in xarray or other python module) to read in subset of a NetCDF file by providing indices of dimensions (lat, lon)? That way I can directly apply functions to the subset file without using the dask.array.

Can you please provide a full example of code that runs more slowly than you expect, and a bit more detail about what the netCDF file looks like and how slow it is to load? — shoyer, Apr 25 '18 at 18:46
Of course. Here is the example: the climate data is around 3 TB. I did not know where to upload such big files. The speed to .load() is slow. climate_dir = 'NCEP_data/resample_data/' tmax = xr.open_mfdataset(terra_climate_dir+'*tmax*.nc',chunks={'time':jobid,'lat':36,'lon':72}) tmax_pos = tmax.sel(lat=39.9042,lon=116.4074,method='nearest') tmax_pos_in_memory = tmax_pos.load() — Tong Qiu, Apr 25 '18 at 19:04
The data is a resampled (0.05 degree) CRU-NCEP climate data (3600*7200) at daily temporal scale from 1992 to 2014. It takes half an hour to load the example data to memory. — Tong Qiu, Apr 25 '18 at 21:26

score 2 · Answer 1 · answered Apr 25 '18 at 04:10

2

You can slice the data before loading the variable into memory.

ds = xr.open_dataset('path/to/file')
in_memory = ds.isel(x=slice(10, 1000)).load()

answered Apr 25 '18 at 04:10

Keisuke FUJII

1,306
9
13

That is a great answer, my issue is that .load() is very slow. – Tong Qiu Apr 25 '18 at 17:41

shoyer · Answer 2 · 2018-04-26T17:19:22.337

This issue sounds similar to those discussed in https://github.com/pydata/xarray/issues/1396, but if you're using recent versions of dask that problem should be resolved.

You can potentially improve performance by avoiding explicit chunking until after indexing, e.g., just

tmax = xr.open_mfdataset(terra_climate_dir+'tmax.nc')
tmax_pos = tmax.sel(lat=39.9042,lon=116.4074,method='nearest').compute()

If this doesn't help, then the issue may related to your source data. For example, queries may be slow if data is accessed over a network mounted drive, or if data is loaded netCDF4 files with in-file chunking/compression (which requires reading full chunks into memory).

Thank you very much for your help! I will try to do that and I will keep you updated. — Tong Qiu, Apr 26 '18 at 02:16

read subset of a single NetCDF file using slices of dimensions

2 Answers2