1

I have a zarr file that I'd like to convert to a netcdf which is too large to fit in memory. My computer has 32GB of RAM so writing ~5.5GB chunks shouldn't be a problem. However, within seconds of running this script, my memory usage quickly tops out consuming the available ~20GB and the script fails.

Data: Dropbox link to zarr file containing radar rainfall data for 6/28/2014 over the United States that is around 1.8GB in total.

Code:

import xarray as xr
import zarr

fpath_zarr = "out_zarr_20140628.zarr"

ds_from_zarr = xr.open_zarr(store=fpath_zarr, chunks={'outlat':3500, 'outlon':7000, 'time':30})

ds_from_zarr.to_netcdf("ds_zarr_to_nc.nc", encoding= {"rainrate":{"zlib":True}})

Output:

MemoryError: Unable to allocate 5.48 GiB for an array with shape (30, 3500, 7000) and data type float64

Package versions:

dask                         2022.7.0
xarray                       2022.3.0
zarr                          2.8.1
  • note that zarr data is compressed by default and the total size in memory is likely much larger than 1.8GB. Can you just try using smaller chunks and see if that resolves the problem? – Michael Delgado Aug 17 '22 at 21:42
  • I see that I could have been more clear - yes the zarr is 1.8 GB which is compressed. Uncompressed, the dataset would be 131GB which is too large to fit into memory, hence the need for chunks. The chunks I define in the 6th line above (calling `open_zarr`) are 5.48GB in size. I experimented with smaller chunk sizes and actually am able get it to work eventually, but I'm still confused by the memory requirement seeming to FAR exceed the chunk size. – Daniel Lassiter Aug 17 '22 at 22:49
  • oh - but how many workers are you using? – Michael Delgado Aug 17 '22 at 23:54

1 Answers1

0

See the dask docs on Best Practices with Dask Arrays. The section on "Select a good chunk size" reads:

A common performance problem among Dask Array users is that they have chosen a chunk size that is either too small (leading to lots of overhead) or poorly aligned with their data (leading to inefficient reading).

While optimal sizes and shapes are highly problem specific, it is rare to see chunk sizes below 100 MB in size. If you are dealing with float64 data then this is around (4000, 4000) in size for a 2D array or (100, 400, 400) for a 3D array.

You want to choose a chunk size that is large in order to reduce the number of chunks that Dask has to think about (which affects overhead) but also small enough so that many of them can fit in memory at once. Dask will often have as many chunks in memory as twice the number of active threads.

I imagine the issue here is that 5.48 GB * n_workers * 2 far exceeds your available 32GB of ram, so at any given point in time, one of your workers is reproducibly failing, so dask considers the whole job to be a problem.

The best way to get around this is to reduce your chunk size. As the docs note, the best chunking strategy depends on your workflow, cluster setup, and hardware; that said, in my experience, it's best to keep your number of tasks under 1 million and your chunk size in the ~150MB - 1 GB range.

Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
  • Ah thank you for the insight, that is exactly what's going on. When I added the line `dask.config.set(scheduler='single-threaded')`, I got the expected behavior. I actually didn't even realize that dask was trying to perform the computations in parallel since netcdfs have to be written serially (https://github.com/pydata/xarray/issues/6920). It also seems that forcing single threaded computations is faster than reduced chunk sizes, but I'd have to run more experiments to see if that's true across the board. – Daniel Lassiter Aug 18 '22 at 15:31
  • zarrs can be read in parallel, so it's possible it would be more efficient to query multiple blocks at a time and then queue whichever comes back first for writing (the multiple workers model). but it's also possible that speedup isn't worth it compared to the benefit of reading larger contiguous blocks. The single-threaded write constraint for netcdfs doesn't change the way dask works, it just places a 1-worker bottleneck at the end of the pipeline. glad you figured it out! – Michael Delgado Aug 18 '22 at 15:41