I have a process that grows a NetCDF file fn
every 5 minutes using netcdf4.Dataset(fn, mode=a)
. I also have a bokeh server visualization of that NetCDF file using a xarray.Dataset
(which I want to keep, because it is so convenient).
The problem is that the NetCDF-update-process fails when trying to add new data to fn
if it is open in my bokeh server process via
ds = xarray.open_dataset(fn)
If I use the option autoclose
ds = xarray.open_dataset(fn, autoclose=True)
updating fn
with the other process while ds
is "open" in the bokeh server app works, but the updates to the bokeh figure, which pull time slices from fn
, get very laggy.
My question is: Is there another way to release the lock of the NetCDF file when using xarray.Dataset
?
I would not care if the shape of the xarray.Dataset is only updated consistently after reloading the whole bokeh server app.
Thanks!
Here is a minimal working example:
Put this into a file and let it run:
import time
from datetime import datetime
import numpy as np
import netCDF4
fn = 'my_growing_file.nc'
with netCDF4.Dataset(fn, 'w') as nc_fh:
# create dimensions
nc_fh.createDimension('x', 90)
nc_fh.createDimension('y', 90)
nc_fh.createDimension('time', None)
# create variables
nc_fh.createVariable('x', 'f8', ('x'))
nc_fh.createVariable('y', 'f8', ('y'))
nc_fh.createVariable('time', 'f8', ('time'))
nc_fh.createVariable('rainfall_amount',
'i2',
('time', 'y', 'x'),
zlib=False,
complevel=0,
fill_value=-9999,
chunksizes=(1, 90, 90))
nc_fh['rainfall_amount'].scale_factor = 0.1
nc_fh['rainfall_amount'].add_offset = 0
nc_fh.set_auto_maskandscale(True)
# variable attributes
nc_fh['time'].long_name = 'Time'
nc_fh['time'].standard_name = 'time'
nc_fh['time'].units = 'hours since 2000-01-01 00:50:00.0'
nc_fh['time'].calendar = 'standard'
for i in range(1000):
with netCDF4.Dataset(fn, 'a') as nc_fh:
current_length = len(nc_fh['time'])
print('Appending to NetCDF file {}'.format(fn))
print(' length of time vector: {}'.format(current_length))
if current_length > 0:
last_time_stamp = netCDF4.num2date(
nc_fh['time'][-1],
units=nc_fh['time'].units,
calendar=nc_fh['time'].calendar)
print(' last time stamp in NetCDF: {}'.format(str(last_time_stamp)))
else:
last_time_stamp = '1900-01-01'
print(' empty file, starting from scratch')
nc_fh['time'][i] = netCDF4.date2num(
datetime.utcnow(),
units=nc_fh['time'].units,
calendar=nc_fh['time'].calendar)
nc_fh['rainfall_amount'][i, :, :] = np.random.rand(90, 90)
print('Sleeping...\n')
time.sleep(3)
Then, go to e.g. IPython and open the growing file via:
ds = xr.open_dataset('my_growing_file.nc')
This will cause the process that appends to the NetCDF to fail with an output like this:
Appending to NetCDF file my_growing_file.nc
length of time vector: 0
empty file, starting from scratch
Sleeping...
Appending to NetCDF file my_growing_file.nc
length of time vector: 1
last time stamp in NetCDF: 2018-04-12 08:52:39.145999
Sleeping...
Appending to NetCDF file my_growing_file.nc
length of time vector: 2
last time stamp in NetCDF: 2018-04-12 08:52:42.159254
Sleeping...
Appending to NetCDF file my_growing_file.nc
length of time vector: 3
last time stamp in NetCDF: 2018-04-12 08:52:45.169516
Sleeping...
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-17-9950ca2e53a6> in <module>()
37
38 for i in range(1000):
---> 39 with netCDF4.Dataset(fn, 'a') as nc_fh:
40 current_length = len(nc_fh['time'])
41
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()
netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()
IOError: [Errno -101] NetCDF: HDF error: 'my_growing_file.nc'
If using
ds = xr.open_dataset('my_growing_file.nc', autoclose=True)
there is no error, but access times via xarray
of course get slower, which is exactly my problem since my dashboard visualization gets very laggy.
I can understand that this is maybe not the intended use for xarray
and, if required, I will fall back to the lower level interface provided by netCDF4
(hoping that it supports concurrent file access, at least for reads), but I would like to keep xarray
for its convenience.