5

I have a large (>100 GB) xarray Dataset holding weather forecast data (dimensions time, forecast step, latitude, longitude, with dask chunks over the time, latitude and longitude dimensions) and want to work out the average weather (for each time point) over an irregularly shaped region (defined by a binary mask array with dimensions latitude and longitude). The naive way of doing this is:

average_weather = weather.where(mask).mean(dim=('latitude', 'longitude'))

However, for most of the (latitude, longitude) chunks, the mask values in that region are all zero, so there is no need to load that chunk. So far as I can tell by briefly looking at the xarray and dask source, there is no optimization that checks if all the mask values for a chunk are zero before loading the chunk, so the naive command will use a great deal of unnecessary data transfer and CPU.

I did see that it is possible to use drop=True in the where command to limit the computation to the bounding box of the mask, but is it possible to do better than this?

user7813790
  • 547
  • 1
  • 4
  • 12
  • I also looked at https://stackoverflow.com/questions/51195138/masking-in-dask "Masking in Dask" but, so far as I can tell, `dask.ma` and `numpy.ma` require the mask array to have the same shape as the data. – user7813790 Jul 29 '19 at 15:11

1 Answers1

2

By default when using where, the values where the mask is False are replaced by NaN. If you use the drop=True keyword, they will be dropped completely. Note that this can change the shape of your data, or lead to unexpected NaN values sometimes.

e.g. average_weather = weather.where(mask, drop=True).mean(dim=('latitude', 'longitude'))

Charles
  • 97
  • 5