Slow reading of small zarr/S3 data through python-xarray inside a dockerized fastAPI app

Question

I have a tiny dataset like this:

<xarray.Dataset>
Dimensions:      (time: 24)
Coordinates:
  * time         (time) datetime64[ns] 2022-09-28 ... 2022-09-28T23:00:00
    spatial_ref  int64 0
Data variables:
    CO           (time) float32 dask.array<chunksize=(24,), meta=np.ndarray>
    NO2          (time) float32 dask.array<chunksize=(24,), meta=np.ndarray>
    O3           (time) float32 dask.array<chunksize=(24,), meta=np.ndarray>
    PM10         (time) float32 dask.array<chunksize=(24,), meta=np.ndarray>
    PM2.5        (time) float32 dask.array<chunksize=(24,), meta=np.ndarray>
    SO2          (time) float32 dask.array<chunksize=(24,), meta=np.ndarray>

This dataset is obtained after some ds.where(), ds.rio.clip() and a final ds.mean(dim=['latitude', 'longitude']) on an original large zarr dataset hosted on a S3 server.

Then I want to access each individual value. I see that

ds['CO'].sel(time=timeToGet).data has a normal speed, but

ds['CO'].sel(time=timeToGet).values and

float(ds['CO'].sel(time=timeToGet).data)) both take 1min15sec ! Why is it like that ?

I tried these before:

ds = ds.chunk(chunks={"time": 1})
ds = ds.chunk(chunks='auto')
ds = ds.copy(deep=True)

but no success.

The ds.where() call on the bigger dataset was slow too and I resolved it with ds.chunk('auto'). I realized it was slow in my dockerized app but not when tested locally on my desktop. So maybe docker has an impact. Actually I don't understand if my small dataset is still on the server or in the memory of my computer ?

Note that `da.data` returns a dask.array, which has not done any of the work yet, whereas `da.values` returns a numpy array and requires executing all read/compute operations which the variable depends on. So it makes sense that the former is always much, much faster for a dask array — Michael Delgado, Oct 13 '22 at 14:35

Michael Delgado · Answer 1 · 2022-10-13T14:54:12.633

0

These variables are dask.arrays, not numpy, and therefore have not been loaded into memory. I’m not sure how you prepared this dataset, but computing the data could involve everything from loading from disk to streaming over the internet to computing a large scheduled graph.

Data this small will certainly fit into memory, so you can improve performance for repeated access by computing all variables once and then working with the local copy:

ds = ds.compute()

If you have read this data from disk and it also was small enough to fit comfortably into memory at the time, you can load the dataset without dask by specifying chunks=None, e.g.:

ds = xr.open_zarr(fp, chunks=None)

Also, dask works fine in containers, but it does require resources. In order to schedule tasks and execute them in parallel, dask spins up multiple threads or processes (depending on your configuration). If the resources granted to your container are too little, it’s possible dask is slowing to a crawl because it must spill data to disk or share processors with the main thread. So if you are going to continue using dask it would be a good idea to keep an eye on your machines resources and check out the dask dashboard.

edited Oct 13 '22 at 14:54

answered Oct 13 '22 at 14:30

Michael Delgado

13,789
3
29
54

Thank you Michael. There was indeed a ressources limitation, I created a .wslconfig to raise them. Actually my original dataset's shape is (time: 6864, level: 1, latitude: 400, longitude: 700). I directly do a where() on "time" to keep only the last 2 days, then the crop on space where I take only some cells. With ds.compute() (that I do after all these) it now takes 48 sec, which is still long but become usable. – PierreL Oct 14 '22 at 09:07
Without ds.compute() it's much longer. Actually docker still downloads 480 Mo of data (my total dataset is written as 38 Go on S3 browser). Would there be a way to do the `ds.where((ds.time >= start_date_dt64) & (ds.time < end_date_dt64), drop=True)` on the server ? so that I only download the last 2 days). – PierreL Oct 14 '22 at 09:07
Yeah for sure - definitely subset the data before pulling it into memory. Using ds.sel will have better performance than where though, eg `ds = ds.sel(time=((ds.time >= start_date_dt64) & (ds.time < end_date_dt64)))` – Michael Delgado Oct 14 '22 at 18:02
That said the smallest amount of data you can pull is limited by the chunk size, so if the data is chucked in blocks that don’t align with your slicing you may still end up downloading more than ideal. Only option there is to change the chunk layout on disk – Michael Delgado Oct 14 '22 at 18:05
Thanks again. Indeed its a bit faster with sel() and I finally gain a lot when removing the `ds.chunk('auto')`. – PierreL Oct 17 '22 at 14:01

Slow reading of small zarr/S3 data through python-xarray inside a dockerized fastAPI app

1 Answers1