2

I'm trying to upload an xarray dataset to GCP using the function ds.to_zarr(store=store), and it works perfect. However, I would like to show the progress of big datasets. Is there any option to chunk my dataset in a way I can use tqdm or someting like that to log the uploading progress?

Here is the code that I currently have:

import os

import xarray as xr
import numpy as np
import gcsfs
from dask.diagnostics import ProgressBar

if __name__ == '__main__':
    # for testing
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "service-account.json"

    # create xarray
    data_arr = np.random.rand(5000, 100, 100)
    data_xarr = xr.DataArray(data_arr,
                             dims=["x", "y", "z"])

    # define store
    gcp_blob_uri = "gs://gprlib/test.zarr"
    gcs = gcsfs.GCSFileSystem()
    store = gcs.get_mapper(gcp_blob_uri)

    # delayed to_zarr computation -> seems that it does not work
    write_job = data_xarr\
        .to_dataset(name="data")\
        .to_zarr(store, mode="w", compute=False)

    print(write_job)
Henry Ruiz
  • 23
  • 4
  • 2
    Welcome to stack overflow! thanks for the question. as a tip - in general, please try to make your code more generally applicable. it's best to create a totally new example for the question - ideally a [mre]. In this case, since we don't have the rest of your code, references to `self` and the file path/gcp management are confusing to your central question. It would be best to strip these out and just ask about the write. – Michael Delgado Feb 16 '23 at 21:08

1 Answers1

2

xarray.Dataset.to_zarr has an optional argument compute which is True by default:

compute (bool, optional) – If True write array data immediately, otherwise return a dask.delayed.Delayed object that can be computed to write array data later. Metadata is always updated eagerly.

Using this, you can track the progress using dask's own dask.distributed.progress bar:

write_job = ds.to_zarr(store, compute=False)
write_job = write_job.persist()

# this will return an interactive (non-blocking) widget if in a notebook
# environment. To force the widget to block, provide notebook=False.
distributed.progress(write_job, notebook=False)
[##############                          ] | 35% Completed |  4.5s

Note that for this to work, the dataset must consist of chunked dask arrays. If the data is in memory, you could use a single chunk per array with ds.chunk().to_zarr.

Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
  • Thanks, @Michael Delgado, for your suggestions and reply. I appreciate it. I have updated the code in the question so it can be easily reproduced. After testing the `compute` option in the `to_zarr` function, it seems that it is ignored when the store is a `gcsfs.GCSFileSystem` path since the function is executed even if the `compute` option is defined as `False.` This is the output when I print the delayed task -> `Delayed('_finalize_store-47cbbaef-68e6-4303-85ec-71e676a2a46a')`. Also, the `GCP blob` has been created. – Henry Ruiz Feb 17 '23 at 02:20
  • 1
    I noticed that my xarray need to be initialized as an dask.array. Now is working!. ` data_arr = np.random.rand(5000, 100, 100) dask_arr = da.from_array(data_arr, chunks=(1000, 100, 100)) data_xarr = xr.DataArray(dask_arr, dims=["x", "y", "z"]) ` – Henry Ruiz Feb 17 '23 at 03:34
  • Oh! Yes that’s true - thanks for pointing that out - I’ll add that to my answer. – Michael Delgado Feb 17 '23 at 04:00