I need to apply a function along the time dimension of an xarray dask array of this shape:
<xarray.DataArray 'tasmax' (time: 14395, lat: 1801, lon: 3600)>
dask.array<rechunk-merge, shape=(14395, 1801, 3600), dtype=float32, chunksize=(14395, 1801, 600), chunktype=numpy.ndarray>
Coordinates:
* lat (lat) float64 90.0 89.9 89.8 89.7 89.6 ... -89.7 -89.8 -89.9 -90.0
* time (time) datetime64[ns] 1981-01-01 1981-01-02 ... 2020-05-31
* lon (lon) float64 -180.0 -179.9 -179.8 -179.7 ... 179.7 179.8 179.9
The output of the process will be a much smaller array with sizes (time=365, lat=1801, lon=3600)
, but the input array memory size as you can see above is around 360 GB. I have a machine with 16 CPU cores and 126 GB RAM. I am trying to optimise the process by using apply_ufunc
with dask='parallelized'
argument, but it leads to memory error as all 126GB RAM gets used. I can avoid parallelization but it will take ages to finish the process. Is there any way I can control the memory usage of apply_ufunc to contain the process within my available RAM?