0

I am trying to use xarray.apply_ufunc on the groupby object with dask parallelization but getting error.

Dataset contains daily temperatural data of 30 years over certain location with 1km² grid. So the data shape is 10950x1450x900 (days, Y axis and X axis respectively).

Main goal is to sort values for each location and for each year. And more importantly algorithm must be memory efficient

Since data is huge (~120gb) and does not fit into memory, I am trying to sort using dask, but from my research I see that there's no simple solution with dask, nor with any other known libs (xarray, numpy...) (if there is any at all)

dask does not implement any sorting algorithm since its very complex for parallelization and etc...

Only function dask gives us is topk method, which returns 0 to N-th element from sorted. When applied on whole dataset memory gets 100%.

So, now I am trying to run numpy.sort with dask parallelization enabled to see if this could help. But I am not able to even test it, since it throws error.

code used:

xarray.apply_ufunc(numpy.sort, dataset.groupby('time.year'), kwargs={'axis': 0}, dask='parallelized', output_dtypes=[numpy.float64])

error:

ValueError: output dtypes (output_dtypes) must be supplied to apply_func when using dask='parallelized'

Am I doing something wrong or apply_ufunc does not support groupby objects?

from the xarray's docs args could be groups too

*args (Dataset, DataArray, GroupBy, Variable, numpy.ndarray, dask.array.Array or scalar) – Mix of labeled and/or unlabeled arrays to which to apply the function.

I am confused how to correctly use it.

Anyway, would be grateful if you suggest any working way.

wol
  • 142
  • 1
  • 14

0 Answers0