2

Should chunking in xarray / dask behave similarly under these two use cases below?

(a) When opening a dataset from netcdf file using the chunks option;

(b) When re-chunking an existing dataset using Dataset.chunk method.

I'm interested in performance for slicing across different dimensions. In my case the performance is quite different, please compare (Case1) and (Case3) below:

(Case1): Open dataset with one single chunk along station dimension (fast for slicing one time)

In [1]: import xarray as xr

In [2]: dset = xr.open_dataset( 
    ...: "/tmp/spectra.nc", 
    ...: chunks={"station": None}
    ...: )

In [3]: dset
Out[3]: 
<xarray.Dataset>
Dimensions:       (direction: 24, frequency: 25, station: 14048, time: 249)
Coordinates:
  * time          (time) datetime64[ns] 2017-01-01 ... 2017-02-01
  * station       (station) float64 1.0 2.0 3.0 ... 1.405e+04 1.405e+04
  * frequency     (frequency) float32 0.04118 0.045298003 ... 0.40561208
  * direction     (direction) float32 90.0 75.0 60.0 45.0 ... 135.0 120.0 105.0
Data variables:
    longitude     (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray>
    latitude      (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray>
    efth          (time, station, frequency, direction) float32 dask.array<chunksize=(249, 14048, 25, 24), meta=np.ndarray>

In [4]: %time lats = dset.latitude.isel(time=0).values
CPU times: user 171 ms, sys: 49.2 ms, total: 220 ms
Wall time: 219 ms

(Case2): Open dataset with many size=1 chunks along station dimension (slow for slicing one time, fast for slicing one station)

In [5]: dset = xr.open_dataset( 
    ...: "/tmp/spectra.nc", 
    ...: chunks={"station": 1}
    ...: )

In [6]: dset
Out[6]: 
<xarray.Dataset>
Dimensions:       (direction: 24, frequency: 25, station: 14048, time: 249)
Coordinates:
  * time          (time) datetime64[ns] 2017-01-01 ... 2017-02-01
  * station       (station) float64 1.0 2.0 3.0 ... 1.405e+04 1.405e+04
  * frequency     (frequency) float32 0.04118 0.045298003 ... 0.40561208
  * direction     (direction) float32 90.0 75.0 60.0 45.0 ... 135.0 120.0 105.0
Data variables:
    longitude     (time, station) float32 dask.array<chunksize=(249, 1), meta=np.ndarray>
    latitude      (time, station) float32 dask.array<chunksize=(249, 1), meta=np.ndarray>
    efth          (time, station, frequency, direction) float32 dask.array<chunksize=(249, 1, 25, 24), meta=np.ndarray>

In [7]: %time lats = dset.latitude.isel(time=0).values
CPU times: user 13.1 s, sys: 1.94 s, total: 15 s
Wall time: 11.1 s

(Case3): Try rechunk station into one single chunk (still slow to slice one time, should it be faster?)

In [8]: dset = dset.chunk({"station": None})

In [8]: dset
Out[8]: 
<xarray.Dataset>
Dimensions:       (direction: 24, frequency: 25, station: 14048, time: 249)
Coordinates:
  * time          (time) datetime64[ns] 2017-01-01 ... 2017-02-01
  * station       (station) float64 1.0 2.0 3.0 ... 1.405e+04 1.405e+04
  * frequency     (frequency) float32 0.04118 0.045298003 ... 0.40561208
  * direction     (direction) float32 90.0 75.0 60.0 45.0 ... 135.0 120.0 105.0
Data variables:
    longitude     (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray>
    latitude      (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray>
    efth          (time, station, frequency, direction) float32 dask.array<chunksize=(249, 14048, 25, 24), meta=np.ndarray>

In [9]: %time lats = dset.latitude.isel(time=0).values
CPU times: user 9.06 s, sys: 1.13 s, total: 10.2 s
Wall time: 7.7 s

For reference, this dataset is stored on disk as netcdf4 with size=1 chunks along station dimension:

$ ncdump -hs /tmp/spectra.nc
netcdf spectra {
dimensions:
        time = UNLIMITED ; // (249 currently)
        station = 14048 ;
        frequency = 25 ;
        direction = 24 ;
variables:
        double time(time) ;
                time:long_name = "julian day (UT)" ;
                time:standard_name = "time" ;
                time:units = "days since 1990-01-01 00:00:00" ;
                time:_Storage = "chunked" ;
                time:_ChunkSizes = 512 ;
                time:_DeflateLevel = 9 ;
                time:_Shuffle = "true" ;
                time:_Endianness = "little" ;
        int station(station) ;
                station:long_name = "station id" ;
                station:_FillValue = -2147483647 ;
                station:_Storage = "chunked" ;
                station:_ChunkSizes = 1 ;
                station:_DeflateLevel = 9 ;
                station:_Shuffle = "true" ;
                station:_Endianness = "little" ;
        short longitude(time, station) ;
                longitude:long_name = "longitude" ;
                longitude:standard_name = "longitude" ;
                longitude:units = "degree_east" ;
                longitude:_FillValue = 9.96921e+36f ;
                longitude:scale_factor = -0.00547824f ;
                longitude:add_offset = 180.f ;
                longitude:_Storage = "chunked" ;
                longitude:_ChunkSizes = 249, 1 ;
                longitude:_DeflateLevel = 9 ;
                longitude:_Shuffle = "true" ;
                longitude:_Endianness = "little" ;
        short latitude(time, station) ;
                latitude:long_name = "latitude" ;
                latitude:standard_name = "latitude" ;
                latitude:units = "degree_north" ;
                latitude:_FillValue = 9.96921e+36f ;
                latitude:scale_factor = -0.0006866874f ;
                latitude:add_offset = -54.f ;
                latitude:_Storage = "chunked" ;
                latitude:_ChunkSizes = 249, 1 ;
                latitude:_DeflateLevel = 9 ;
                latitude:_Shuffle = "true" ;
                latitude:_Endianness = "little" ;
        float frequency(frequency) ;
                frequency:long_name = "frequency of center band" ;
                frequency:standard_name = "sea_surface_wave_frequency" ;
                frequency:units = "s-1" ;
                frequency:scale_factor = 1.f ;
                frequency:add_offset = 0.f ;
                frequency:_FillValue = 9.96921e+36f ;
                frequency:_Storage = "chunked" ;
                frequency:_ChunkSizes = 25 ;
                frequency:_DeflateLevel = 9 ;
                frequency:_Shuffle = "true" ;
                frequency:_Endianness = "little" ;
        float direction(direction) ;
                direction:long_name = "sea surface wave to direction" ;
                direction:standard_name = "sea_surface_wave_to_direction" ;
                direction:units = "degree" ;
                direction:scale_factor = 1.f ;
                direction:add_offset = 0.f ;
                direction:_FillValue = 9.96921e+36f ;
                direction:_Storage = "chunked" ;
                direction:_ChunkSizes = 24 ;
                direction:_DeflateLevel = 9 ;
                direction:_Shuffle = "true" ;
                direction:_Endianness = "little" ;
        short efth(time, station, frequency, direction) ;
                efth:long_name = "sea surface wave directional variance spectral density" ;
                efth:standard_name = "sea_surface_wave_directional_variance_spectral_density" ;
                efth:units = "m2 s rad-1" ;
                efth:_FillValue = 9.96921e+36f ;
                efth:scale_factor = -0.004410254f ;
                efth:add_offset = 144.5064f ;
                efth:_Storage = "chunked" ;
                efth:_ChunkSizes = 249, 1, 25, 24 ;
                efth:_DeflateLevel = 9 ;
                efth:_Shuffle = "true" ;
                efth:_Endianness = "little" ;

// global attributes:
                :nco_openmp_thread_number = 1 ;
                :_NCProperties = "version=2,netcdf=4.6.2,hdf5=1.10.4" ;
                :_SuperblockVersion = 2 ;
                :_IsNetcdf4 = 1 ;
                :_Format = "netCDF-4" ;
}
rafa
  • 235
  • 1
  • 2
  • 10
  • I have the same problem. Chunking when calling open_mfdataset has a very different perfomance that rechunking using .chunk method after calling open_mdfdataset without chunking – susopeiz Jul 16 '20 at 10:30

0 Answers0