SVD on huge dataset with dask and xarray

Question

I use xarray and dask to open multiple netcdf4 files that all together are around 200Gb via

import xarray as xr

ds = xr.open_mfdataset('/path/files*.nc', parallel=True)

The dimensions of this dataset "ds" are (longitude, latitude, height, time). The files are automatically concatenated along time, which is okay. Now I would like to apply the "svd_compressed" function from the dask library. I would like to reshape the longitude, latitude, and height dimension into one dimension, such that I have a 2-d matrix on which I can apply the svd.

I tried using the

dask.array.reshape

function, but I get "'Dataset' object has no attribute 'shape'".

I can convert the xarray dataset to an array and use stack, which makes it 2-d, but If I then use

Dataset.to_dask_dataframe

to convert my xarray to a dask dataframe, my memory runs out.

Somebody has an Idea how I can tackle this problem? Should I chunk my data differently for the "to_dask_dataframe" function? Or can I use somehow the "dask svd_compressed" function on the loaded netcdf4 dataset without a reshape?

Thanks for the help.

Edit:

Here a code example that is not working. I have donwloaded Data from the ERA5 (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-pressure-levels?tab=overview), which I load from disk. After that I take the temperature data and stack the longitude, latitude, and level values in one dimension to have a time-space 2d-array. Then I would like to apply an SVD on the data.

from dask.distributed import Client, progress
import xarray as xr
import dask
import dask.array as da

client = Client(processes=False, threads_per_worker=4,
                n_workers=1, memory_limit='9GB')

ds = xr.open_mfdataset('/home/user/Arbeit/ERA5/Data/era_5_m*.nc', parallel=True)
ds = ds['t']
ds = ds.stack(z=("longitude", "latitude", "level"))

u, s, v = da.linalg.svd_compressed(ds, k=5, compute=True)

I get an error "dot only operates on DataArrays." I assume its because I need to convert it to a dask array, so I do.

da = ds.to_dask_dataframe()

which gives me "DataArray' object has no attribute 'to_dask_dataframe". So I try

ds = ds.to_dataset(name="temperature")
da = ds.to_dask_dataframe()

which results in "Unable to allocate 89.4 GiB for an array with shape". I guess I need to rechunk it?

Can you provide all your code? And are you using [`dask.array.linalg.svd`](https://docs.dask.org/en/stable/generated/dask.array.linalg.svd.html)? That needs a dask array, not a dataframe, so you should be able to do this without converting to a dataframe — Michael Delgado, Dec 23 '22 at 15:36
Hi @MichaelDelgado thanks for the response. So i want to extract from all those netcdf4 files loaded via xarray with `xr.open_mfdataset('/path/files*.nc', parallel=True)` specific variables and their respective datavalues. I then collect them in a 2-d dask array and then apply `da.linalg.svd_compressed` because my Datararray will be very large (At least around 30-40Gb). But converting from xarray to dask makes problems, because it runs out of memory everytime. Do I miss a trick here? — FelixK, Dec 25 '22 at 03:58
@MichaelDelgado I'm sorry. I would like to give you more code, but all I did so far is trying out different combinations of `ds.stack()` , `ds.to_array()` and the above mentioned functions. I always run into the problem that my memory runs. — FelixK, Dec 25 '22 at 04:14
But please provide your full actual code. See the guide to [ask] and how to create a [mre]. For example, from your description I can’t see why you’re converting to a dask.dataframe, but you include that code. Provide one implementation and the corresponding error. — Michael Delgado, Dec 25 '22 at 07:32
@MichaelDelgado okay, sorry about that. I added an example code in an edit section. Hope that clarifies. Thanks in advance. — FelixK, Dec 26 '22 at 16:00
xr.DataArray, xr.Dasaset, dask.array, and dask.dataframe are all different things. Don’t use dask.dataframe for this. To access the dask.array behind a chunked xr.DataArray, just use `da.data`. But it’s still hard to track exactly where the error is coming from. When asking about errors, please always include the [full traceback](//realpython.com/python-traceback) — Michael Delgado, Dec 26 '22 at 17:35
@MichaelDelgado Thank you very much! That solved the problem. Wow, I feel dumb, such a simple solution... — FelixK, Dec 27 '22 at 15:45
Not dumb! It’s usually something simple - all these libraries take some getting used to. Glad that helped! — Michael Delgado, Dec 27 '22 at 15:51

SVD on huge dataset with dask and xarray

0 Answers0