I am working jointly with vaex and dask for some analysis. In the first part of the analysis I do some processing with dask.dataframe
, and my intention is to export the dataframe I computed into something vaex reads. I want to export the data into a memory-mappable format, like hdf or arrow.
dask allows exports into hdf and parquet files. Vaex allows imports as hdf and arrow. Both allow exports and imports as csv files, but I want to avoid that.
So far I got the following options (and problems):
- If I export into an hdf5 file, since dask exports the file in a row format, but vaex reads it in a column format, the file cannot be imported (https://vaex.readthedocs.io/en/latest/faq.html).
- I can export the data into parquet files, but I don't know how to read them from vaex. I've seen some answer in SO that transforms the files into an arrow table, but this requires the table to be loaded into memory, which I can't because the table is too large to fit into memory.
I can of course do an export into a csv and load it in chunks into vaex, then export it into a column-format hdf, but I don't think that should be the purpose of two modules for big objects.
Is there any option I am missing and that would be compatible to "bridge" the two modules without either loading the full table into memory, or having to read/write the dataset twice?