0

Xarray and Dask documentation explain how to load a large NetCDF dataset into Xarray with Dask backing. But what if I have a very large CSV file (> 1GB)? Is it possible to load that into an Xarray dataset through Dask? Either loading it into Xarray in a way that engages the Dask backend, or loading it as a Dask dataframe and translating that dataframe into an Xarray dataset?

I'd like to use Xarray with this dataset, because it is made up of experimental results, and I would like to index into the experimental results based on different settings of the independent variables (which I would use as dimensions).

Xarray's from_dataframe() (http://xarray.pydata.org/en/stable/generated/xarray.Dataset.from_dataframe.html?highlight=from_dataframe) only mentions supporting Pandas DataFrames, nothing about dask.

The Dask documentation (https://examples.dask.org/xarray.html) describes only loading from a saved Xarray Dataset.

Robert P. Goldman
  • 860
  • 1
  • 8
  • 16
  • perhaps [this post](https://stackoverflow.com/questions/65490931/xarray-loading-several-csv-files-into-a-dataset) is helpful? – Val Jan 07 '21 at 11:31
  • Thanks, @Val -- I did read that question and answer, but it does not seem applicable to me, because I have a single, enormous CSV, not a set of individual ones to assemble. Also, this relied on pandas to read the CSV, and then transforms the individual pandas dataframes into Xarray structures. But AFAICT there's no way to translate a *Dask* Dataframe to Xarray, only *Pandas* Dataframes. – Robert P. Goldman Jan 07 '21 at 14:28
  • Right, I misread. Maybe a bit more info would be good to know what you want to do. Why is using xarray a requirement? Only because of dask or do you need it downstream? – Val Jan 07 '21 at 14:35
  • @Val -- I will try to edit this into the question, but for now there are two reasons to use xarray -- one because multi-dimensional indexing would be helpful (and dask's accessing primitives like `loc` don't easily address chunks of data) and two because the dask API is so confusing and poorly documented (there are a ton of places where the docs are essentially "this is what Pandas does, we do something like this, but we aren't going to tell you what will be different"). – Robert P. Goldman Jan 07 '21 at 16:54

0 Answers0