dask.array
has topk
and argtopk
methods which you can use to find the largest (or smallest) k values across a chunked array. You could adapt this to xarray using the following:
In [52]: def topk_xr(da, n, dim):
...: """get the largest n (or smallest if n is negative) along dim"""
...: axis = da.get_axis_num(dim)
...: largest = da.data.topk(n, axis=axis)
...: dims = [d for d in da.dims if d != dim]
...: dims.insert(axis, 'rank')
...: res = xr.DataArray(
...: largest,
...: dims=dims,
...: coords={
...: 'rank': range(0, abs(n)),
...: **{d: da.coords[d] for d in da.dims if d != dim}
...: },
...: )
...:
...: return res
...:
You can then call this on a DataArray to get the top k values along whichever dimension you'd like:
In [54]: topk_xr(ds['myvar'], 20, dim='time')
Out[54]:
<xarray.DataArray 'topk_aggregate-aggregate-858fdf' (rank: 20, y: 10, x: 10)>
dask.array<topk_aggregate-aggregate, shape=(20, 10, 10), dtype=float64, chunksize=(20, 10, 10), chunktype=numpy.ndarray>
Coordinates:
* rank (rank) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
* y (y) int64 20 21 22 23 24 25 26 27 28 29
* x (x) int64 -110 -109 -108 -107 -106 -105 -104 -103 -102 -101
Similarly, you can map it across all arrays in a Dataset, assuming they are similarly shaped:
In [57]: ds.map(topk_xr, n=20, dim='time')
Out[57]:
<xarray.Dataset>
Dimensions: (rank: 20, y: 10, x: 10)
Coordinates:
* rank (rank) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
* y (y) int64 20 21 22 23 24 25 26 27 28 29
* x (x) int64 -110 -109 -108 -107 -106 -105 -104 -103 -102 -101
Data variables:
myarr (rank, y, x) float64 dask.array<chunksize=(20, 10, 10), meta=np.ndarray>
If you wanted to find the positional indices of these maxima/minima, you could use argtopk
instead of topk
in the function.