1

I have started working with large datasets from Copernicus Marine Service. I am downloading the netcdf files through motuclient and then i can process (using xarray) the data to calculate the mean value for each position of the grid. I would like to calculate the average of the 20 highest values (extremes). How can i accomplish that? Can i use xarray or should i look for something else? My code for calculating the average of all values is:

ds = xr.open_mfdataset(file, engine="rasterio")
yearly_data = (ds).mean("time")
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
Vassilis
  • 19
  • 1
  • Can you paste the result of `print(ds)` and include detail of how the data is chunked? This would be easier for us to answer with a [mre] but at minimum it would help to have a sense of the dims, shape, and chunking scheme. – Michael Delgado May 24 '22 at 15:01
  • Also welcome to stack overflow! Thanks for the question - it’s a chewy one :) xarray can definitely do this but it’ll take getting into the dask weeds a bit - finding cross-chunk summary stats like this can be a bit tough depending on the stat. – Michael Delgado May 24 '22 at 15:02

1 Answers1

0

dask.array has topk and argtopk methods which you can use to find the largest (or smallest) k values across a chunked array. You could adapt this to xarray using the following:


In [52]: def topk_xr(da, n, dim):
    ...:     """get the largest n (or smallest if n is negative) along dim"""
    ...:     axis = da.get_axis_num(dim)
    ...:     largest = da.data.topk(n, axis=axis)
    ...:     dims = [d for d in da.dims if d != dim]
    ...:     dims.insert(axis, 'rank')
    ...:     res = xr.DataArray(
    ...:         largest,
    ...:         dims=dims,
    ...:         coords={
    ...:             'rank': range(0, abs(n)),
    ...:             **{d: da.coords[d] for d in da.dims if d != dim}
    ...:         },
    ...:     )
    ...:
    ...:     return res
    ...:

You can then call this on a DataArray to get the top k values along whichever dimension you'd like:

In [54]: topk_xr(ds['myvar'], 20, dim='time')
Out[54]:
<xarray.DataArray 'topk_aggregate-aggregate-858fdf' (rank: 20, y: 10, x: 10)>
dask.array<topk_aggregate-aggregate, shape=(20, 10, 10), dtype=float64, chunksize=(20, 10, 10), chunktype=numpy.ndarray>
Coordinates:
  * rank     (rank) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y        (y) int64 20 21 22 23 24 25 26 27 28 29
  * x        (x) int64 -110 -109 -108 -107 -106 -105 -104 -103 -102 -101

Similarly, you can map it across all arrays in a Dataset, assuming they are similarly shaped:

In [57]: ds.map(topk_xr, n=20, dim='time')
Out[57]:
<xarray.Dataset>
Dimensions:  (rank: 20, y: 10, x: 10)
Coordinates:
  * rank     (rank) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y        (y) int64 20 21 22 23 24 25 26 27 28 29
  * x        (x) int64 -110 -109 -108 -107 -106 -105 -104 -103 -102 -101
Data variables:
    myarr    (rank, y, x) float64 dask.array<chunksize=(20, 10, 10), meta=np.ndarray>

If you wanted to find the positional indices of these maxima/minima, you could use argtopk instead of topk in the function.

Michael Delgado
  • 13,789
  • 3
  • 29
  • 54