Initialize larger-than-memory Xarray Dataset

Question

I would like to initialize a very large XArray dataset (as on-disk Zarr if possible) for later processing - various parts (spatial subsets) of the dataset will be populated by a different script.

This won't work, because the dataset obviously doesn't fit into memory.

import numpy as np
import xarray as xr
xr_lons = xr.DataArray(np.arange(-180, 180, 0.001), dims=['x'], name='lons')
xr_lats = xr.DataArray(np.arange(90, -90, -0.001), dims=['y'], name='lats')
xr_da = xr.DataArray(0, dims=['y', 'x'], coords=[xr_lats, xr_lons])
xr_ds = xr.Dataset({"test": xr_da})
xr_ds.to_zarr("test.zarr", mode="w")

MemoryError: Unable to allocate 483. GiB for an array with shape (180000, 360000) and data type int64

What would be a good alternative?

I'm looking for a solution like this using plain zarr:

import zarr
root = zarr.open('example.zarr', mode='w')
mosaics = root.create_group('mosaics')
dsm = mosaics.create_dataset('dsm', shape=(nrows, ncols), chunks=(1024, 1024), dtype='i4')

The dataset, however, doesn't conform to the normal XArray structure and metadata so I'm looking for a solution directly using XArray (or Dask?)

score 1 · Accepted Answer · answered Sep 19 '22 at 10:44

You can use dask to "initialize" an xarray Dataset which is backed by a lazy dask array and then write it to disk in chunks concurrently.

import dask.array as da
import numpy as np
import xarray as xr

lat = np.arange(90, -90, -0.001)
lon = np.arange(-180, 180, 0.001)

x = xr.Dataset(
    coords={
        "lat": (["lat"], lat),
        "lon": (["lon"], lon),
    },
    data_vars={
        "dsm": (
            ["lat", "lon"],
            da.zeros((lat.size, lon.size), chunks=(1024, 1024), dtype="uint8"),
        )
    },
)

This gives you an xarray dataset with a dask array dsm:

You can now do any kind of further modification or just write it to disk using .to_zarr.

A couple of notes:

Your array is fairly large, choosing the correct datatype is crucial. If not specified it'll use float64 which will be 483. GiB for your array. If you specify for instance uint8 like in my example, it'll be only 60.35 GiB for the same size.

Another crucial factor is the chunk size. Using (1024, 1024) from your example yields very small chunks which means a ton of tasks which will likely result in issues for either dask or your disk I/O.

Initialize larger-than-memory Xarray Dataset

1 Answers1