2

I'm having quite a problem when converting a zarr file to a dask array. This is what I get when I type arr = da.from_zarr('gros.zarr/time') : enter image description here

but when I try on one coordinates such as time it works: enter image description here

Any Ideas how to solve this ?

Severus
  • 35
  • 4
  • do you know how it was written? can you open it with [`xarray`](https://docs.xarray.dev/en/stable/generated/xarray.open_zarr.html), as in `xr.open_zarr('gros.zarr')`? – Michael Delgado Jun 22 '22 at 07:53
  • also I think zarr arrays need to be named. there should be a group for your array inside the directory, e.g. `gros.zarr/gros`. Can you just `ls` the directory to find the name of the array? – Michael Delgado Jun 22 '22 at 07:55
  • Yes , I can open it with xarray. – Severus Jun 22 '22 at 08:23
  • Oh great! Does that answer your question then? – Michael Delgado Jun 22 '22 at 08:41
  • But I want to open it as a DaskArray. Xarray reads it but then you need to convert it to a Dask dataframe and then a dask array, but unfortunately you loose all the metadata when doing so – Severus Jun 22 '22 at 08:55
  • 1
    Oh no not at all! da.data goes straight to a dask.Array if the array is chucked. In most cases, an xarray dataset or DataArray created from a zarr array is already a dask array, just wrapped in an xarray container. See https://docs.xarray.dev/en/stable/user-guide/dask.html and also https://examples.dask.org/xarray.html – Michael Delgado Jun 22 '22 at 09:01

2 Answers2

2

As noted in the comments by @Michael Delgado, if xarray works, then that's probably the best option.

However, if for some reason you do want to open it with dask.array, then you can specify the component of interest using component kwarg:

from dask.array import from_zarr

x = from_zarr("gros.zarr", component="time")

For some reproducible examples, see this blog post.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Well , I found a solution. I opened the zarr with xarray then i converted it to a dask daraFrame then a Dask array.I didn't lose any metadata too. – Severus Jun 22 '22 at 13:17
  • 2
    That's interesting, but sounds like a hack... so it might be worth exploring further to get a more robust solution. – SultanOrazbayev Jun 22 '22 at 13:19
1

When you read a zarr array in xarray, dask will be enabled by default, unless you specify chunks=None. You absolutely do not have to go through dask.dataframe - you can go straight from xarray.DataArray to dask.Array. In fact, there's not even a copy required - all you need to do is access the .data attribute underlying the DataArray.

Here's an example from a file I have laying around:

In [3]: import xarray as xr
   ...: import os
   ...:
   ...: fp = os.path.join(
   ...:     ROOT_DIR,
   ...:     'ScenarioMIP/INM/INM-CM5-0/ssp370/r1i1p1f1/day/tasmax/v1.1.zarr'
   ...: )
   ...: 
   ...: ds = xr.open_zarr(fp)
   ...: ds
Out[3]:
<xarray.Dataset>
Dimensions:  (lat: 720, lon: 1440, time: 31390)
Coordinates:
  * lat      (lat) float64 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9
  * time     (time) object 2015-01-01 12:00:00 ... 2100-12-31 12:00:00
Data variables:
    tasmax   (time, lat, lon) float32 dask.array<chunksize=(365, 360, 360), meta=np.ndarray>
Attributes: (12/47)
    Conventions:                  CF-1.7 CMIP-6.2
    activity_id:                  ScenarioMIP AerChemMIP
    contact:                      climatesci@rhg.com
    creation_date:                2019-06-17T08:27:21Z
    data_specs_version:           01.00.29
    dc6_bias_correction_method:   Quantile Delta Method (QDM)
    ...                           ...
    sub_experiment_id:            none
    table_id:                     day
    tracking_id:                  hdl:21.14100/da7e759e-3979-42e4-b92f-02e7e2...
    variable_id:                  tasmax
    variant_label:                r1i1p1f1
    version_id:                   v20190618

You can think of xarray Datasets as fancy dictionaries holding DataArrays as objects. DataArrays themselves are just N-dimensional arrays with labeled indices. The data contained in a DataArray is provided by an array "backend", which is usually numpy or dask.Array. When you read in a zarr dataset, the result will be a dask.Array with a bit of extra xarray index & metadata handling on top. We can see that the values in this array are a dask array by inspecting the array preview at the top:

In [4]: ds.tasmax
Out[4]:
<xarray.DataArray 'tasmax' (time: 31390, lat: 720, lon: 1440)>
dask.array<open_dataset-51b28ad08603ab401a85808d9fa3d6d7tasmax, shape=(31390, 720, 1440), dtype=float32, chunksize=(365, 360, 360), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float64 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9
  * time     (time) object 2015-01-01 12:00:00 ... 2100-12-31 12:00:00
Attributes:
    cell_measures:  area: areacella
    cell_methods:   area: mean time: maximum (interval: 1 day)
    comment:        maximum near-surface (usually, 2 meter) air temperature (...
    coordinates:    height
    history:        2019-06-17T08:27:21Z altered by CMOR: Treated scalar dime...
    long_name:      Daily Maximum Near-Surface Air Temperature
    original_name:  tasmax
    standard_name:  air_temperature
    units:          K

Xarray is a great library which allows you to use pandas-style indexing in an N-dimensional space. But if you want to work with the dask.array directly, you can simply access the .data attribute on a dask-backed xarray DataArray:

In [5]: ds.tasmax.data
Out[5]: dask.array<open_dataset-51b28ad08603ab401a85808d9fa3d6d7tasmax, shape=(31390, 720, 1440), dtype=float32, chunksize=(365, 360, 360), chunktype=numpy.ndarray>
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
  • what if you have multiple data variables ? How would you extract the dask arrays ? – Severus Jun 28 '22 at 09:11
  • I think I've found it. You convert the dataSet to dataArray, then use the [xarray.DataArray.chunk()](https://xarray.pydata.org/en/stable/generated/xarray.DataArray.chunk.html) . Then recover the dask array with data method – Severus Jun 28 '22 at 09:42
  • 1
    Or you could just access each one’s data method separately! No need to concat unless you actually want the data concatenated. – Michael Delgado Jun 28 '22 at 14:34