I am loading in the following data using xr.mfdataset
. There is 16GB of data, across many files.
import xarray as xr
from datetime import datetime
from pathlib import Path
from dask.diagnostics import ProgressBar
def add_time_dim(xda: xr.Dataset) -> xr.Dataset:
# https://stackoverflow.com/a/65416801/9940782
xda = xda.expand_dims(time = [datetime.now()])
return xda
raw_folder = data_dir / "raw/modis_ndvi_1000"
files = [f for f in raw_folder.glob("*.nc")]
data = xr.open_mfdataset(files, preprocess=add_time_dim)
data
<xarray.Dataset>
Dimensions: (time: 647, lat: 5600, lon: 4480)
Coordinates:
* time (time) datetime64[ns] 2004-04-30 2005-10-10 ... 2018-10-31
* lat (lat) float64 -19.99 -19.98 -19.97 -19.96 ... 29.98 29.99 30.0
* lon (lon) float64 20.0 20.01 20.02 20.03 ... 59.96 59.97 59.98 59.99
Data variables:
modis_ndvi (time, lat, lon) float32 dask.array<chunksize=(1, 5600, 4480), meta=np.ndarray>
After selecting my region of interest I have halved the size of the dataset (~8GB)
<xarray.Dataset>
Dimensions: (time: 647, lat: 1255, lon: 983)
Coordinates:
* time (time) datetime64[ns] 2004-04-30 2005-10-10 ... 2018-10-31
* lat (lat) float64 -5.196 -5.187 -5.179 -5.17 ... 5.982 5.991 6.0
* lon (lon) float64 33.51 33.52 33.53 33.54 ... 42.26 42.27 42.28
Data variables:
modis_ndvi (time, lat, lon) float32 dask.array<chunksize=(1, 1255, 983), meta=np.ndarray>
## Every time I try to save the data, the process is Killed
. How can I write this large file to netcdf?
out_folder = data_dir / "interim/modis_ndvi_1000_preprocessed"
out_folder.mkdir(exist_ok=True)
out_file = out_folder / f"modis_ndvi_1000_{subset_str}.nc"
data.to_netcdf(out_file, compute=False)
with ProgressBar():
print(f"Writing to {out_file}")
data.compute()
Killed
What do I need to do? How to cope with this large dataset, how to write it out to disk in parallel?
The process is killed because all available memory gets used up (which can be seen when watching htop
)