0

I am reading NetCDF files using xarray. Each variable have 4 dimensions (Times, lev, y, x). After reading the variable, I calculate the mean of the variable QVAPOR along (Times,lev) dimensions. After calculation I get variable QVAPOR_mean which is a 2D variable with shape (y: 699, x: 639).

Xarray took only 10micro seconds to read the data with shape (Times:2918, lev:36, y:699, x:639); but took more than 60 minutes to plot the filled contour of the data of shape (y: 699, x: 639).

I am wondering how come Xarray is taking extremely long time (more than 60 mins) to plot the contourf of array with size (y: 699, x: 639).

I use following code for reading the files and perform computation.

flnm=xr.open_mfdataset('./WRF_3D_2007_*.nc',chunks={'Times': 100})
QVAPOR_mean=flnm.QVAPOR.mean(dim=('Times','lev')
QVAPOR_mean.plot.imshow()

The last command takes more than 60 mins to complete. Help is appreciated. Thank You

Sopan Kurkute
  • 128
  • 1
  • 9
  • How big is your total dataset, in GB? `2918 * 36 * 699 * 639 * 8 / 2**30 = 350GB` ? You could play with the chunk sizes, but I'm not sure how much better you could hope for. – mdurant Mar 14 '18 at 14:27
  • Yes, it is around ~350 GB. I'd already chuck the data along `Time` dimension. The computation is very fast, the only problem is plotting. Python should not take more than few seconds to plot data with shape `(y: 699, x: 639)`. I'm wondering what is happening? – Sopan Kurkute Mar 14 '18 at 16:28

1 Answers1

1

When you open your dataset and provide the chunks argument, xarray is returning a Dataset that is comprised of dask arrays. These arrays are evaluated "lazily" (xarray/dask documentation). It is not until you plot your data that the computation is triggered. To illustrate this, you can explicitly load your data after you take the mean:

flnm=xr.open_mfdataset('./WRF_3D_2007_*.nc',chunks={'Times': 100})
QVAPOR_mean=flnm.QVAPOR.mean(dim=('Times','lev').load()

Now your QVAPOR_mean variable is backed by a numpy array instead of a dask array. Plotting this array will likely be much faster.

However, the computation of your mean is likely to still take quite a long time. There are ways improve the throughput here as well.

  • Try using a larger chunk size. I often find that chunk sizes in the 10-100Mb range perform best.

  • Try a different scheduler. You are by default using dask's threaded scheduler. Because of limitations with netCDF/HDF, this does not allow for parallel reads from disk. We have been finding that the distributed scheduler works well for these applications.

jhamman
  • 5,867
  • 19
  • 39