2

I would like to load my 499 NetCDF files by xarray and concatenate them however, it seems that I am stuck up at saving the file.

Here's my code snippet:

import pandas as pd
import numpy as np
import xarray as xr
import os

files_xr = [f for f in os.listdir(os.getcwd()) if f.startswith("Precipitation") and f.endswith(".nc")]
files_xr_mer = xr.open_mfdataset(files_xr, combine='by_coords')
files_xr_mer['units'] = 'mm'
new_filename_1 = './prec_file_testing.nc'
files_xr_mer.to_netcdf(path=new_filename_1)

Traceback (most recent call last)
MemoryError: Unable to allocate 2.80 GiB for an array with shape (29, 3601, 7199) and data type float32 

Thanks for any suggestion! I would like to definitely use python and NCO or CDO as the last option!

Lukáš Tůma
  • 350
  • 2
  • 15

1 Answers1

1

You could try passing a value for the chunks keyword in open_mfdataset. This should enable data streaming, where not everything is loaded into memory at once. https://xarray.pydata.org/en/stable/generated/xarray.open_mfdataset.html

E.g. chunks={"time": 1} if time is one of your dimensions will result in chunks being loaded one-by-one. There might be some interaction with the concatenation, you might have to take into account how the concatenation is happening to make it (more) efficient.

See also this documentation: https://xarray.pydata.org/en/stable/dask.html#chunking-and-performance

Huite Bootsma
  • 451
  • 2
  • 6
  • Thank you! I will test it. Actually, I was trying to use the Dask and delayed computing or `parallel=True` in `open_mfdataset`, and It didn't help me... So `chunks` seems promising – Lukáš Tůma Feb 17 '21 at 08:53
  • Okey... I found out that agERA5 files from the year 2018 up to now are for precipitation flux are probably broken so I will something less buggy... – Lukáš Tůma Feb 26 '21 at 12:41
  • 2
    Even with chunking we were running into memory problems. We were reading Zarr and writing NetCDF, and memory would grow until the kernel crashed. Because we can't write NetCDF in parallel, we weren't starting a cluster, but that wasn't taking advantage of the new memory management in dask distributed version `2021.07.1`. But when we created a client with one worker `client=Client(n_workers=1)`, memory didn't grow and it worked beautifully! – Rich Signell Jul 29 '21 at 12:42
  • Yes, that is an excellent suggestion. I've personally also seen a number of cases where the default scheduler results in memory crashes, but `from dask.distributed import Client; client = Client(...)` solved my issues immediately! – Huite Bootsma Jul 29 '21 at 20:02
  • Could you share the complete line you wrote for "client(...)"? – fyec Feb 23 '22 at 14:07
  • I am getting that error when I put "chunks={"time": 1}" : "PerformanceWarning: Slicing is producing a large chunk. To accept the large. chunk and silence this warning, set the option with dask.config.set(**{'array.slicing.split_large_chunks': False}):" – fyec Feb 24 '22 at 06:32