3

A series of about 90 netCDF files each around 27 MB each, opened with xarray's open_mfdataset takes a long time to load a small space-time selection.

Chunking dimensions yield marginal gain. decode_cf=True either inside the function or separate has no difference either. Another suggestion here https://groups.google.com/forum/#!topic/xarray/11lDGSeza78 had me save the selection as a separate netCdf and reload it.

It seems to bottleneck when the dask portion has to do some work (loading, computing, converting to a pandas dataframe).

Generating a graph with dask.visualize generates a huge image. It may be telling us something, but I'm not sure how to interpret.

wind = xr.open_mfdataset(testCCMPPath,\
                         decode_cf=True,\
                         chunks={'time': 100,\
                                 'latitude': 100,\
                                 'longitude': 100})
%timeit wind.sel(latitude=latRange, longitude=windLonRange, time=windDateQueryRange).load()

wxr = wind.sel(latitude=latRange, longitude=windLonRange, time=windDateQueryRange)
df = wxr.to_dataframe()
print(df.shape)

timeit output shows

1.93 s ± 29.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

df.shape output is only 164x3.

I have a similar sel for another xr array and am getting times of about .05 seconds, however this has a lot of sparse points. The wind xr array has few empty spaces.

1 Answers1

2

It turns out that the number of files proved too much for dask to handle efficiently.

These files have dimensions of latitude, longitude, and time. Time, in this case, has a granularity of 3 hours. The time scales I'm operating in made it so that I'm working with ~35000 files. Too much for dask to handle. I got around this by merging files by year, reducing the number of .nc files to 12.

CDO (Climate Data Operators) is a utility that lets us merge files quickly. See [https://www.unidata.ucar.edu/software/netcdf/software.html#CDO][1] for more details.

An example of how I used cdo: For a set of files in the directory ./precip/2004, I ran the shell command to create a concatenated netCDF file 2004.nc

cdo cat ./precip/2004/*.nc4 2004.nc

From there, xr.open_mfdataset() performs much better.