8

I am trying to set up a performance test to perform a repetitive reading of a netcdf file with different chunking configuration to finally determine the best chunk size for a certain use case. One issue that I encounter is that when reading the file using xarray.open_dataset(), even when the cache is set as False, it still somehow stores a cache in the memory. I know this is the case based on two indicators:

  • the reading is always slow when it is run for the first time.
  • with RamMap application, I see that the file that was opened is still in the memory even after the dataset is closed.

Here is the code that I ran:

ds = xr.open_dataset("path/to/netcdf/file", engine='h5netcdf', cache=False)

lat_dim = 2160
lon_dim = 4320
time_dim = 46
read_chunk_size = 2160

data = np.empty((time_dim, lat_dim, lon_dim))
data[0:time_dim, 0:read_chunk_size, 0:read_chunk_size] = \
ds['value'][0:time_dim, 0:read_chunk_size, 0:read_chunk_size]

ds.close()

It is clear that my understanding to caching in xarray is very minimal. Hence, I would be really grateful if someone could explain me how it actually works and subsequently, how to exploit it in a multi-run performance test.

jhamman
  • 5,867
  • 19
  • 39
r051cky
  • 115
  • 4
  • `xarray` source code doesn't appear to make any calls to netcdf4-python function [set_var_chunk_cache](https://unidata.github.io/netcdf4-python/#Variable.set_var_chunk_cache). Perhaps it is not possible in xarray. Opening in netCDF4 might be the best option. If you want to use xarray, perhaps spread your reads randomly across the dataset to minimize chances of being able to use the netcdf chunk cache. – Robert Davy Apr 30 '21 at 01:26
  • 1
    Did you ever solve this? – Rob Mar 10 '22 at 10:16

0 Answers0