I'm trying to find the best/fastest way to save a dask array to a geotiff.
I've been doing research on how to make rasterio/GDAL thread-safe but I'm not coming up with anything that is easily accessible from python.
If I use da.store(..., lock=False)
then the file can get corrupted.
I tried opening and closing the file for each write, but that still uses the same file descriptor (correct me if I'm wrong) and isn't a great solution anyway.
Does anyone have another way of doing this so that each dask worker (thread) can safely write to the geotiff file created by the rasterio library? The current working solution is to leave dask.store
with its default of lock-True
.
I'm guessing any other solution will involve thread locking anyway, but I thought it would be good to have this solution here anyway.
My working example code is below:
import dask.array as da
import numpy as np
import rasterio
from rasterio.windows import Window
class RIOFile(object):
"""Rasterio wrapper to allow da.store to do window saving."""
def __init__(self, *args, **kwargs):
"""Initialize the object."""
self.args = args
self.kwargs = kwargs
self.rfile = None
def __setitem__(self, key, item):
"""Put the data chunk in the image."""
if len(key) == 3:
indexes = list(range(
key[0].start + 1,
key[0].stop + 1,
key[0].step or 1
))
y = key[1]
x = key[2]
else:
indexes = 1
y = key[0]
x = key[1]
chy_off = y.start
chy = y.stop - y.start
chx_off = x.start
chx = x.stop - x.start
# band indexes
self.rfile.write(item, window=Window(chx_off, chy_off, chx, chy),
indexes=indexes)
def __enter__(self):
"""Enter method."""
self.rfile = rasterio.open(*self.args, **self.kwargs)
return self
def __exit__(self, exc_type, exc_value, traceback):
"""Exit method."""
self.rfile.close()
rows = cols = 21696
a = da.ones((4, rows, cols), dtype=np.float64, chunks=(1, 4096, 4096) )
a = a * np.array([255., 255., 255., 255.])[:, None, None]
a = a.astype(np.uint8)
with RIOFile('test.tif', 'w', driver='GTiff', width=cols, height=rows, count=4, dtype=np.uint8) as r_file:
da.store(a, r_file, lock=True)
Change lock=False
for a probably corrupt file.
I've also been able to get successful, although not guaranteed, output by increasing GDAL's internal cache size.