Should chunking in xarray / dask behave similarly under these two use cases below?
(a) When opening a dataset from netcdf file using the chunks
option;
(b) When re-chunking an existing dataset using Dataset.chunk
method.
I'm interested in performance for slicing across different dimensions. In my case the performance is quite different, please compare (Case1) and (Case3) below:
(Case1): Open dataset with one single chunk along station
dimension (fast for slicing one time)
In [1]: import xarray as xr
In [2]: dset = xr.open_dataset(
...: "/tmp/spectra.nc",
...: chunks={"station": None}
...: )
In [3]: dset
Out[3]:
<xarray.Dataset>
Dimensions: (direction: 24, frequency: 25, station: 14048, time: 249)
Coordinates:
* time (time) datetime64[ns] 2017-01-01 ... 2017-02-01
* station (station) float64 1.0 2.0 3.0 ... 1.405e+04 1.405e+04
* frequency (frequency) float32 0.04118 0.045298003 ... 0.40561208
* direction (direction) float32 90.0 75.0 60.0 45.0 ... 135.0 120.0 105.0
Data variables:
longitude (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray>
latitude (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray>
efth (time, station, frequency, direction) float32 dask.array<chunksize=(249, 14048, 25, 24), meta=np.ndarray>
In [4]: %time lats = dset.latitude.isel(time=0).values
CPU times: user 171 ms, sys: 49.2 ms, total: 220 ms
Wall time: 219 ms
(Case2): Open dataset with many size=1 chunks along station
dimension (slow for slicing one time, fast for slicing one station)
In [5]: dset = xr.open_dataset(
...: "/tmp/spectra.nc",
...: chunks={"station": 1}
...: )
In [6]: dset
Out[6]:
<xarray.Dataset>
Dimensions: (direction: 24, frequency: 25, station: 14048, time: 249)
Coordinates:
* time (time) datetime64[ns] 2017-01-01 ... 2017-02-01
* station (station) float64 1.0 2.0 3.0 ... 1.405e+04 1.405e+04
* frequency (frequency) float32 0.04118 0.045298003 ... 0.40561208
* direction (direction) float32 90.0 75.0 60.0 45.0 ... 135.0 120.0 105.0
Data variables:
longitude (time, station) float32 dask.array<chunksize=(249, 1), meta=np.ndarray>
latitude (time, station) float32 dask.array<chunksize=(249, 1), meta=np.ndarray>
efth (time, station, frequency, direction) float32 dask.array<chunksize=(249, 1, 25, 24), meta=np.ndarray>
In [7]: %time lats = dset.latitude.isel(time=0).values
CPU times: user 13.1 s, sys: 1.94 s, total: 15 s
Wall time: 11.1 s
(Case3): Try rechunk station
into one single chunk (still slow to slice one time, should it be faster?)
In [8]: dset = dset.chunk({"station": None})
In [8]: dset
Out[8]:
<xarray.Dataset>
Dimensions: (direction: 24, frequency: 25, station: 14048, time: 249)
Coordinates:
* time (time) datetime64[ns] 2017-01-01 ... 2017-02-01
* station (station) float64 1.0 2.0 3.0 ... 1.405e+04 1.405e+04
* frequency (frequency) float32 0.04118 0.045298003 ... 0.40561208
* direction (direction) float32 90.0 75.0 60.0 45.0 ... 135.0 120.0 105.0
Data variables:
longitude (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray>
latitude (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray>
efth (time, station, frequency, direction) float32 dask.array<chunksize=(249, 14048, 25, 24), meta=np.ndarray>
In [9]: %time lats = dset.latitude.isel(time=0).values
CPU times: user 9.06 s, sys: 1.13 s, total: 10.2 s
Wall time: 7.7 s
For reference, this dataset is stored on disk as netcdf4 with size=1 chunks along station
dimension:
$ ncdump -hs /tmp/spectra.nc
netcdf spectra {
dimensions:
time = UNLIMITED ; // (249 currently)
station = 14048 ;
frequency = 25 ;
direction = 24 ;
variables:
double time(time) ;
time:long_name = "julian day (UT)" ;
time:standard_name = "time" ;
time:units = "days since 1990-01-01 00:00:00" ;
time:_Storage = "chunked" ;
time:_ChunkSizes = 512 ;
time:_DeflateLevel = 9 ;
time:_Shuffle = "true" ;
time:_Endianness = "little" ;
int station(station) ;
station:long_name = "station id" ;
station:_FillValue = -2147483647 ;
station:_Storage = "chunked" ;
station:_ChunkSizes = 1 ;
station:_DeflateLevel = 9 ;
station:_Shuffle = "true" ;
station:_Endianness = "little" ;
short longitude(time, station) ;
longitude:long_name = "longitude" ;
longitude:standard_name = "longitude" ;
longitude:units = "degree_east" ;
longitude:_FillValue = 9.96921e+36f ;
longitude:scale_factor = -0.00547824f ;
longitude:add_offset = 180.f ;
longitude:_Storage = "chunked" ;
longitude:_ChunkSizes = 249, 1 ;
longitude:_DeflateLevel = 9 ;
longitude:_Shuffle = "true" ;
longitude:_Endianness = "little" ;
short latitude(time, station) ;
latitude:long_name = "latitude" ;
latitude:standard_name = "latitude" ;
latitude:units = "degree_north" ;
latitude:_FillValue = 9.96921e+36f ;
latitude:scale_factor = -0.0006866874f ;
latitude:add_offset = -54.f ;
latitude:_Storage = "chunked" ;
latitude:_ChunkSizes = 249, 1 ;
latitude:_DeflateLevel = 9 ;
latitude:_Shuffle = "true" ;
latitude:_Endianness = "little" ;
float frequency(frequency) ;
frequency:long_name = "frequency of center band" ;
frequency:standard_name = "sea_surface_wave_frequency" ;
frequency:units = "s-1" ;
frequency:scale_factor = 1.f ;
frequency:add_offset = 0.f ;
frequency:_FillValue = 9.96921e+36f ;
frequency:_Storage = "chunked" ;
frequency:_ChunkSizes = 25 ;
frequency:_DeflateLevel = 9 ;
frequency:_Shuffle = "true" ;
frequency:_Endianness = "little" ;
float direction(direction) ;
direction:long_name = "sea surface wave to direction" ;
direction:standard_name = "sea_surface_wave_to_direction" ;
direction:units = "degree" ;
direction:scale_factor = 1.f ;
direction:add_offset = 0.f ;
direction:_FillValue = 9.96921e+36f ;
direction:_Storage = "chunked" ;
direction:_ChunkSizes = 24 ;
direction:_DeflateLevel = 9 ;
direction:_Shuffle = "true" ;
direction:_Endianness = "little" ;
short efth(time, station, frequency, direction) ;
efth:long_name = "sea surface wave directional variance spectral density" ;
efth:standard_name = "sea_surface_wave_directional_variance_spectral_density" ;
efth:units = "m2 s rad-1" ;
efth:_FillValue = 9.96921e+36f ;
efth:scale_factor = -0.004410254f ;
efth:add_offset = 144.5064f ;
efth:_Storage = "chunked" ;
efth:_ChunkSizes = 249, 1, 25, 24 ;
efth:_DeflateLevel = 9 ;
efth:_Shuffle = "true" ;
efth:_Endianness = "little" ;
// global attributes:
:nco_openmp_thread_number = 1 ;
:_NCProperties = "version=2,netcdf=4.6.2,hdf5=1.10.4" ;
:_SuperblockVersion = 2 ;
:_IsNetcdf4 = 1 ;
:_Format = "netCDF-4" ;
}