Avoiding memory overflow while using xarray dask apply_ufunc

Question

I need to apply a function along the time dimension of an xarray dask array of this shape:

<xarray.DataArray 'tasmax' (time: 14395, lat: 1801, lon: 3600)>
dask.array<rechunk-merge, shape=(14395, 1801, 3600), dtype=float32, chunksize=(14395, 1801, 600), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float64 90.0 89.9 89.8 89.7 89.6 ... -89.7 -89.8 -89.9 -90.0
  * time     (time) datetime64[ns] 1981-01-01 1981-01-02 ... 2020-05-31
  * lon      (lon) float64 -180.0 -179.9 -179.8 -179.7 ... 179.7 179.8 179.9

The output of the process will be a much smaller array with sizes (time=365, lat=1801, lon=3600), but the input array memory size as you can see above is around 360 GB. I have a machine with 16 CPU cores and 126 GB RAM. I am trying to optimise the process by using apply_ufunc with dask='parallelized' argument, but it leads to memory error as all 126GB RAM gets used. I can avoid parallelization but it will take ages to finish the process. Is there any way I can control the memory usage of apply_ufunc to contain the process within my available RAM?

Have you tried changing `chunksize`? (14395, 1801, 600) seems a bit too large. Each chunk is about 60GB in size, but that doesn't mean that during computation you won't exceed 126GB of allocated memory (depending on the function you apply). For example you could use could use `DataArray.chunk({'time':14395, 'lat':180, 'lon':360)`. Before using `apply_ufunc`. http://xarray.pydata.org/en/stable/generated/xarray.DataArray.chunk.html — hyperpiano, Oct 31 '20 at 13:07
Yes I have tried different chunk sizes, small and big on both lat and lon. It still exceeds my memory. :( — Monobakht, Oct 31 '20 at 14:36
`chunksize=(14395, 1801, 600)` is definitely too big as the dask workers will use multiple chunks at one time. As a rule of thumb, chunks should be ~100mb in size. Also, look at the dask dashboard what's going on. — Val, Nov 11 '20 at 17:34

score 1 · Answer 1 · answered Oct 06 '21 at 09:27

1

The chunksize you're using is too big. As Val mentioned in the comments, ~100mb is the recommend size for each chunck.

You can change 1801 to something between 1-4 for better performance:

chunksize=(14395, 4, 600)

answered Oct 06 '21 at 09:27

pavithraes

724
5
9

Avoiding memory overflow while using xarray dask apply_ufunc

1 Answers1