1

I have a zarr store of weather data with 1 hr time interval for the year 2022. So 8760 chunks. But there are data only for random days. How do i check which are the hours in 0 to 8760, the data is available? Also the store is defined with "fill_value": "NaN",

I am iterating over each hour and checking for all nan as below (using xarray) to identify if there is data or not. But its a very time consuming process.

hours = 8760
for hour in range(hours):
    if not np.isnan(np.array(xarrds['temperature'][hour])).all():
        print(f"data available in hour: {i}")

is there a better way to check the data availablity?

sjd
  • 1,329
  • 4
  • 28
  • 48

1 Answers1

0

Don't use an outer loop, and execute the command in parallel using dask:

# assuming your data is already chunked along time, i.e. .chunk({'time': 1})
da = xarrds['temperature']

# get the names of non-time dims to reduce over
non_time_dims = [d for d in da.dims if d != 'time']

# create boolean DataArray indexed by time giving where array is all NaN
all_null_by_hour = da.isnull().all(dim=non_time_dims)

# compute the array
all_null_by_hour = all_null_by_hour.compute()
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54