Context: I have a netcdf file that I want to reproject. It is a costly operation, and I am learning how to use dask
and zarr
to do it efficiently without crashing my RAM.
Code presentation: ds
is a 3D xarray
dataset (dimensions: time
, y
, x
). This array is in projection EPSG:32607
and I want it in EPSG:3413
. To do that, I open the dataset by chunking it with dask
, and resample it reproject it with rioxarray
. I then save it as a zarr
(with either xarray
or dask
, idk which one is the best).
Problems: I want to do two things:
- Have the reprojection work on the chunks without crashing my memory.
- Find valid chunk size automatically so I can save the dataset to
.zarr
throughdask
(last line of code).
How can I do that ?
Code:
# Fetch the path for essential python scripts, grab only the variable "v"
#ds = xr.open_dataset('Cubes/temp/Cube_32607.nc', chunks=({'time': 100, 'y': 100, 'x': 100})).v
# Example reproducing dataset the size of ds
start_date = np.datetime64('1984-06-01')
end_date = np.datetime64('2020-01-01')
num_dates = 18206
len_x = 333
len_y = 334
# Generate time dimension
date_array = date_range = np.linspace(start_date.astype('datetime64[s]').view('int64'), end_date.astype('datetime64[s]').view('int64'), num_dates).astype('datetime64[s]')
# Generate spatial dimension
xs = np.linspace(486832.5, 566512.5, 333)
ys = np.linspace(6695887.5, 6615967.5, 334)
# Create example data
v = np.random.rand(len(date_array), len(ys), len(xs))
ds = xr.Dataset({'v': (('time', 'y', 'x'), v)},
coords={'x': xs, 'y': ys, 'time': date_array}, chunks=({'mid_date': 100, 'y': 100, 'x': 100}))
# Import pyproj to get the CRS
import pyproj
# Attribute a projection to the dataset (currently missing its projection)
ds = ds.rio.write_crs("EPSG:32607", inplace=True)
# Reproject the dataset
ds = ds.rio.reproject("EPSG:3413",resampling=Resampling.bilinear)
# Save it a zarr
ds.to_zarr('Cubes/temp/Reprojected_Cube/Reprojected_Cube.zarr', mode='w', compute=False)
# Trigger the parallel computation to write the data to the Zarr store (Is this line a better way to do it ?)
dask.compute(ds.to_zarr('Cubes/temp/Reprojected_Cube/Reprojected_Cube.zarr', mode='w'))