I have a large (>100 GB) xarray Dataset holding weather forecast data (dimensions time, forecast step, latitude, longitude, with dask chunks over the time, latitude and longitude dimensions) and want to work out the average weather (for each time point) over an irregularly shaped region (defined by a binary mask array with dimensions latitude and longitude). The naive way of doing this is:
average_weather = weather.where(mask).mean(dim=('latitude', 'longitude'))
However, for most of the (latitude, longitude) chunks, the mask values in that region are all zero, so there is no need to load that chunk. So far as I can tell by briefly looking at the xarray and dask source, there is no optimization that checks if all the mask values for a chunk are zero before loading the chunk, so the naive command will use a great deal of unnecessary data transfer and CPU.
I did see that it is possible to use drop=True
in the where
command to limit the computation to the bounding box of the mask, but is it possible to do better than this?