I have some ensemble files in grib format that I would like to lazy load in Python using dask and xarray. Based in https://climate-cms.org/2018/09/14/dask-era-interim.html, I managed to lazy load the files as intended, but now I want to slice and select the dimensions to plot the data for some time and level.
UPDATE: I've recently came back to this issue and I finally figured out that instead of using da.concatenate
, I should use da.stack
. This simple change solved my problem. This issue is updated accordingly, in case anyone need an example on how to create an ensemble of grib files using python (with dask arrays for lazy load), to load and plot data in the same fashion as one would do using softwares like GrADS.
My program looks like:
import dask
import dask.array as da
import xarray as xr
import pandas as pd
import numpy as np
from glob import glob
from datetime import date, datetime, timedelta
import matplotlib.pyplot as plt
bpath = '/some/path/to/my/data'
# pressure levels
levels =['1000', '925', '850', '700', '500', '300', '250', '200', '50']
# ensemble member names
ensm = ['M01', 'M02', 'M03', 'M04', 'M05']
@dask.delayed
def open_file_delayed(file, vname):
ds = xr.open_dataset(file, decode_cf='False', engine='pynio')
return ds
def open_file(file, vname, nlevs, nlats, nlons, ftype):
file_data = open_file_delayed(file, vname)[vname].data
return da.from_delayed(file_data, (nlevs, nlats, nlons), ftype)
# list of files to open (sorted by date)
# filename mask: PREFIXMEMYYYYiMMiDDiHHiYYYYfMMfDDfHHf.grb
# MEM: member name (see the levels list)
# YYYYiMMiDDiHHi: analysis date (passed as an argument to the open_file function)
# YYYYfMMfDDfHHf: forecast date
files = sorted(glob(bpath + '/%(dateanl)s/%(mem)s/PREFIX%(mem)s%(dateanl)s*.grb'%
{'dateanl': date, 'mem': member}))
ntime = len(files)
# open the first file in the list to get dimensions and coordinates
ds0 = xr.open_dataset(files[0], decode_cf='False', engine='pynio')
var0 = ds0[vname]
levs = ds0.lv_ISBL2.data
lats = ds0.g4_lat_0.data
lons = ds0.g4_lon_1.data
nlevs = ds0.lv_ISBL2.size
nlats = ds0.g4_lat_0.size
nlons = ds0.g4_lon_1.size
ftype = var0.dtype
ds0.close()
# calculate the date range of the forecasts, in my case len(date_fmt) = 61 (every grib file has 61 times and 9 levels)
date_fmt = pd.date_range(start=datetime.strptime(date, "%Y%m%d%H"), freq="6H", periods=ntime)
# call the function 'open_file' for all files contained in the list 'files' and stack them up
dask_var = da.stack([open_file(file, vname, nlevs, nlats, nlons, ftype) for file in files], axis=0)
# xda is the data array with all files
xda = xr.DataArray(dask_var, dims=['tlev', 'lat', 'lon'])
# set coordinates values
xda.coords['time'] = ('time', date_fmt)
xda.coords['lat'] = ('lat', lats)
xda.coords['lon'] = ('lon', lons)
return xda
To use this code, I do (for a single analysis date - 202005300 - May 30, 2020, and a variable called ZGEO
):
Note: this part is very fast (it takes miliseconds), as we are just creating a map structure to the actual data, similar to a GrADS control file.
lens_zgeo = [open_ensemble('2020053000', ens, 'ZGEO') for ens in ensm]
dens_zgeo = xr.concat(lens_zgeo, dim='ens')
dens_zgeo.coords['ens'] = ('ens', ensm)
dens_zgeo
is a data array with the following sctructure:
From this point, I can slice the dimensions of the data array and plot (which was what I've intented originally):
Note: this part takes longer because the data needs to be read from the disk.
dens_zgeo.isel(ens=0,time=0,lev=0).plot()
BOOM, case closed. Thanks!