Select dimensions by name from a dask chunk

Question

I have some ensemble files in grib format that I would like to lazy load in Python using dask and xarray. Based in https://climate-cms.org/2018/09/14/dask-era-interim.html, I managed to lazy load the files as intended, but now I want to slice and select the dimensions to plot the data for some time and level.

UPDATE: I've recently came back to this issue and I finally figured out that instead of using da.concatenate, I should use da.stack. This simple change solved my problem. This issue is updated accordingly, in case anyone need an example on how to create an ensemble of grib files using python (with dask arrays for lazy load), to load and plot data in the same fashion as one would do using softwares like GrADS.

My program looks like:

import dask
import dask.array as da
import xarray as xr
import pandas as pd
import numpy as np
from glob import glob
from datetime import date, datetime, timedelta

import matplotlib.pyplot as plt

bpath = '/some/path/to/my/data'

# pressure levels
levels =['1000', '925', '850', '700', '500', '300', '250', '200', '50']

# ensemble member names
ensm = ['M01', 'M02', 'M03', 'M04', 'M05'] 

@dask.delayed
def open_file_delayed(file, vname):
    ds = xr.open_dataset(file, decode_cf='False', engine='pynio')
    return ds

def open_file(file, vname, nlevs, nlats, nlons, ftype):
    file_data = open_file_delayed(file, vname)[vname].data
    return da.from_delayed(file_data, (nlevs, nlats, nlons), ftype)

    # list of files to open (sorted by date)
    # filename mask: PREFIXMEMYYYYiMMiDDiHHiYYYYfMMfDDfHHf.grb
    # MEM: member name (see the levels list)
    # YYYYiMMiDDiHHi: analysis date (passed as an argument to the open_file function)
    # YYYYfMMfDDfHHf: forecast date
    files = sorted(glob(bpath + '/%(dateanl)s/%(mem)s/PREFIX%(mem)s%(dateanl)s*.grb'%
                        {'dateanl': date, 'mem': member}))
   
    ntime = len(files)

    # open the first file in the list to get dimensions and coordinates
    ds0 = xr.open_dataset(files[0], decode_cf='False', engine='pynio')
   
    var0 = ds0[vname]
   
    levs = ds0.lv_ISBL2.data
    lats = ds0.g4_lat_0.data
    lons = ds0.g4_lon_1.data
    
    nlevs = ds0.lv_ISBL2.size
    nlats = ds0.g4_lat_0.size
    nlons = ds0.g4_lon_1.size
    ftype = var0.dtype
   
    ds0.close()
    
    # calculate the date range of the forecasts, in my case len(date_fmt) = 61 (every grib file has 61 times and 9 levels)
    date_fmt = pd.date_range(start=datetime.strptime(date, "%Y%m%d%H"), freq="6H", periods=ntime)
        
    # call the function 'open_file' for all files contained in the list 'files' and stack them up
    dask_var = da.stack([open_file(file, vname, nlevs, nlats, nlons, ftype) for file in files], axis=0)
        
    # xda is the data array with all files
    xda = xr.DataArray(dask_var, dims=['tlev', 'lat', 'lon']) 
            
    # set coordinates values
    xda.coords['time'] = ('time', date_fmt)
    xda.coords['lat'] = ('lat', lats)
    xda.coords['lon'] = ('lon', lons)
    
    return xda

To use this code, I do (for a single analysis date - 202005300 - May 30, 2020, and a variable called ZGEO):

Note: this part is very fast (it takes miliseconds), as we are just creating a map structure to the actual data, similar to a GrADS control file.

lens_zgeo = [open_ensemble('2020053000', ens, 'ZGEO') for ens in ensm]
dens_zgeo = xr.concat(lens_zgeo, dim='ens')
dens_zgeo.coords['ens'] = ('ens', ensm)

dens_zgeo is a data array with the following sctructure:

data array structure

From this point, I can slice the dimensions of the data array and plot (which was what I've intented originally):

Note: this part takes longer because the data needs to be read from the disk.

dens_zgeo.isel(ens=0,time=0,lev=0).plot()

BOOM, case closed. Thanks!

score 0 · Answer 1 · answered Oct 28 '21 at 18:46

0

I've edited the question with the modifications I needed in order to get the result I wanted. For this case, the main point is the use of da.stack instead of da.concatenate. By doing so, I've got the resulting data array to get concatenated in the ensemble dimension I needed.

answered Oct 28 '21 at 18:46

OOM

1
2

1

I suggest that you only put the answer _in the answer_. Don't edit the question so that the answer is in the question, as that defeats the purpose of the question. – Sylvester Kruin Oct 28 '21 at 20:20

Select dimensions by name from a dask chunk

1 Answers1