How to reproject xarray dataset memory efficiently with chunks and dask?

Question

Context: I have a netcdf file that I want to reproject. It is a costly operation, and I am learning how to use dask and zarr to do it efficiently without crashing my RAM.

Code presentation: ds is a 3D xarray dataset (dimensions: time, y, x). This array is in projection EPSG:32607 and I want it in EPSG:3413. To do that, I open the dataset by chunking it with dask, and resample it reproject it with rioxarray. I then save it as a zarr (with either xarray or dask, idk which one is the best).

Problems: I want to do two things:

Have the reprojection work on the chunks without crashing my memory.
Find valid chunk size automatically so I can save the dataset to .zarr through dask (last line of code).

How can I do that ?

Dataset description:

Code:

# Fetch the path for essential python scripts, grab only the variable "v"
#ds = xr.open_dataset('Cubes/temp/Cube_32607.nc', chunks=({'time': 100, 'y': 100, 'x': 100})).v

# Example reproducing dataset the size of ds

start_date = np.datetime64('1984-06-01')
end_date = np.datetime64('2020-01-01')
num_dates = 18206
len_x = 333
len_y = 334
# Generate time dimension
date_array = date_range = np.linspace(start_date.astype('datetime64[s]').view('int64'), end_date.astype('datetime64[s]').view('int64'), num_dates).astype('datetime64[s]')
# Generate spatial dimension
xs = np.linspace(486832.5, 566512.5, 333)
ys = np.linspace(6695887.5, 6615967.5, 334)

# Create example data
v = np.random.rand(len(date_array), len(ys), len(xs))
ds = xr.Dataset({'v': (('time', 'y', 'x'), v)},
                 coords={'x': xs, 'y': ys, 'time': date_array}, chunks=({'mid_date': 100, 'y': 100, 'x': 100}))



# Import pyproj to get the CRS
import pyproj

# Attribute a projection to the dataset (currently missing its projection)
ds = ds.rio.write_crs("EPSG:32607", inplace=True)

# Reproject the dataset
ds = ds.rio.reproject("EPSG:3413",resampling=Resampling.bilinear)

# Save it a zarr
ds.to_zarr('Cubes/temp/Reprojected_Cube/Reprojected_Cube.zarr', mode='w', compute=False)

# Trigger the parallel computation to write the data to the Zarr store (Is this line a better way to do it ?)
dask.compute(ds.to_zarr('Cubes/temp/Reprojected_Cube/Reprojected_Cube.zarr', mode='w'))

score 1 · Accepted Answer · answered May 05 '23 at 06:56

1

So as far as I am aware reprojection itself in rioxarray isn't dask compatible so you might run into issues.

However there are workarounds using rasterios WarpedVRT:

epsg_to = 4326
with rasterio.open(f) as src:
    print('Source CRS:' +str(src.crs))
    with WarpedVRT(src,resampling=1,src_crs=src.crs,crs=crs.CRS.from_epsg(epsg_to),warp_mem_limit=12000,warp_extras={'NUM_THREADS':2}) as vrt:
        print('Destination CRS:' +str(vrt.crs))
        ds = rioxarray.open_rasterio(vrt).chunk({'x':1500,'y':1500,'band':1}).to_dataset(name='HLS_Red')

ds = ds.persist()
wait(ds)
ds

Using this you would have to open and reproject the netCDF variables individually. You can open an individual netCDF variable using this syntax: 'netcdf:/path/to/file.nc:variable'

Also see the git issues here and here as well as the github gist where this code comes.

answered May 05 '23 at 06:56

JonasV

792
5
16

Thanks ! I'll wait to have it compute before accepting this as the final answer. It's 3h in so far, still going – Nihilum May 06 '23 at 00:42
Running for now 36h and no sign of finishing. `f = 'netcdf:Cubes/temp/Cube_32607.nc:v' epsg_to = 3413 with rasterio.open(f) as src: print('Source CRS:' +str(src.crs)) with WarpedVRT(src,resampling=1,src_crs=src.crs,crs=crs.CRS.from_epsg(epsg_to),warp_mem_limit=12000,warp_extras={'NUM_THREADS':2}) as vrt: print('Destination CRS:' +str(vrt.crs)) ds = rioxarray.open_rasterio(vrt).ch unk({'x':100,'y':100,'band':20}).to_dataset(name='Cubes/temp/Reprojected_Cube/test.nc') ds = ds.persist() wait(ds) ds ` – Nihilum May 08 '23 at 16:32
Any idea what is going wrong ? The whole thing is 18206x334x333 – Nihilum May 08 '23 at 16:34
1

So the parameter `warp_mem_limit=12000` is actually pretty low. It's only 12kb so you might want to try to take that out or tweak that. I haven't done this myself yet, only looked into it quite a bit so unfortunately I can't give much better advice than that. – JonasV May 09 '23 at 10:07
I will accept your answer since it works, I think for large datasets I will just have to wait for dask to be implemented in the reprojection. There's some activity on xarray's github about that, so fingers crossed ! – Nihilum May 29 '23 at 16:49

How to reproject xarray dataset memory efficiently with chunks and dask?

1 Answers1