I am trying to set up a performance test to perform a repetitive reading of a netcdf file with different chunking configuration to finally determine the best chunk size for a certain use case. One issue that I encounter is that when reading the file using xarray.open_dataset(), even when the cache is set as False, it still somehow stores a cache in the memory. I know this is the case based on two indicators:
- the reading is always slow when it is run for the first time.
- with RamMap application, I see that the file that was opened is still in the memory even after the dataset is closed.
Here is the code that I ran:
ds = xr.open_dataset("path/to/netcdf/file", engine='h5netcdf', cache=False)
lat_dim = 2160
lon_dim = 4320
time_dim = 46
read_chunk_size = 2160
data = np.empty((time_dim, lat_dim, lon_dim))
data[0:time_dim, 0:read_chunk_size, 0:read_chunk_size] = \
ds['value'][0:time_dim, 0:read_chunk_size, 0:read_chunk_size]
ds.close()
It is clear that my understanding to caching in xarray is very minimal. Hence, I would be really grateful if someone could explain me how it actually works and subsequently, how to exploit it in a multi-run performance test.