I have a tiny dataset like this:
<xarray.Dataset>
Dimensions: (time: 24)
Coordinates:
* time (time) datetime64[ns] 2022-09-28 ... 2022-09-28T23:00:00
spatial_ref int64 0
Data variables:
CO (time) float32 dask.array<chunksize=(24,), meta=np.ndarray>
NO2 (time) float32 dask.array<chunksize=(24,), meta=np.ndarray>
O3 (time) float32 dask.array<chunksize=(24,), meta=np.ndarray>
PM10 (time) float32 dask.array<chunksize=(24,), meta=np.ndarray>
PM2.5 (time) float32 dask.array<chunksize=(24,), meta=np.ndarray>
SO2 (time) float32 dask.array<chunksize=(24,), meta=np.ndarray>
This dataset is obtained after some ds.where()
, ds.rio.clip()
and a final ds.mean(dim=['latitude', 'longitude'])
on an original large zarr dataset hosted on a S3 server.
Then I want to access each individual value. I see that
ds['CO'].sel(time=timeToGet).data
has a normal speed, but
ds['CO'].sel(time=timeToGet).values
and
float(ds['CO'].sel(time=timeToGet).data))
both take 1min15sec !
Why is it like that ?
I tried these before:
ds = ds.chunk(chunks={"time": 1})
ds = ds.chunk(chunks='auto')
ds = ds.copy(deep=True)
but no success.
The ds.where() call on the bigger dataset was slow too and I resolved it with ds.chunk('auto')
. I realized it was slow in my dockerized app but not when tested locally on my desktop. So maybe docker has an impact. Actually I don't understand if my small dataset is still on the server or in the memory of my computer ?