3

I am using Dask/Xarray with a ~150 GB dataset on a distributed cluster on a HPC system. I have the computation component complete, which takes about ~30 minutes. I want to save the final result to a NETCDF4 file, but writing the data to a NETCDF file is quite slow (~3hrs) and seems to not run in parallel. It is unclear to me if the "to_netcdf" function in Xarray is supposed to support parallel writes. Currently my approach is to write an empty netcdf file with NetCDF4 and then append the data from the Xarray:

f_mosaic = 't1.nc'

meta = {'width': dat_f.shape[1],
        'height': dat_f.shape[2],
        'crs': rasterio.crs.CRS(init='epsg:'+fi['CPER']['Reflectance']['Metadata']['Coordinate_System']['EPSG Code'].value.decode("utf-8")),
        'transform': aff_final,
        'count': dat_f.shape[0]}

with netCDF4.Dataset(f_mosaic, mode='w', format="NETCDF4") as t1:
    # Create spatial dimensions
    y = t1.createDimension('y', meta['width'])
    x = t1.createDimension('x', meta['height'])
    wl_dim = t1.createDimension('wl',meta['count'])
    reflectance = t1.createVariable("reflectance","int16",("wl","y","x",),fill_value=null_val,zlib=True)
    reflectance.setncattr('grid_mapping', 'crs')
    crs = t1.createVariable('crs', 'c')
    crs.spatial_ref = meta['crs'].wkt
    crs.epsg_code = meta['crs'].to_string()
    crs.GeoTransform = " ".join(str(x) for x in meta['transform'].to_gdal())

dat_f.to_netcdf(path=f_mosaic,mode='a',format='NETCDF4',encoding={'reflectance':{'zlib':True}})

Overall, the question is, how can I write this data to a NETCDF4 file quickly? Does dask/Xarray support parallel writes with NETCDF4? If so, what am I doing incorrectly?

Thanks!

Rowan_Gaffney
  • 452
  • 5
  • 17
  • Interesting question, but I assume it does not. Parallel netCDF is different from ordinary netCDF and in order to have this enabled, the library onto which netCDF4 Python is written on, has to be built with parallel support (I am thinking about netCDF C library). I am looking forward for others comments and answers. – msi_gerva Sep 26 '18 at 13:40
  • I realise that CDF may be a requirement for you, but have you considered [zarr](https://zarr.readthedocs.io/en/stable/), which is built specifically with parallelism in mind and is supported by xarray? – mdurant Sep 26 '18 at 14:05
  • thanks for the tips @mdurant and msi_gerva. I was hoping to stick with a file format compatible with gdal and rasterio. However, I may try zarr if I am unable to improve the write speeds with netCDF. – Rowan_Gaffney Sep 26 '18 at 16:16
  • 2
    We have an experimental rewrite of xarray's file backends that should be significantly faster for writing netCDF files: https://github.com/pydata/xarray/pull/2261. Could you give it a try and report on your experience? Feel free to ask any questions as a comment on the GitHub pull request. – shoyer Sep 26 '18 at 16:48

0 Answers0