Bridging exports and imports between dask and vaex

Question

I am working jointly with vaex and dask for some analysis. In the first part of the analysis I do some processing with dask.dataframe, and my intention is to export the dataframe I computed into something vaex reads. I want to export the data into a memory-mappable format, like hdf or arrow.

dask allows exports into hdf and parquet files. Vaex allows imports as hdf and arrow. Both allow exports and imports as csv files, but I want to avoid that.

So far I got the following options (and problems):

If I export into an hdf5 file, since dask exports the file in a row format, but vaex reads it in a column format, the file cannot be imported (https://vaex.readthedocs.io/en/latest/faq.html).
I can export the data into parquet files, but I don't know how to read them from vaex. I've seen some answer in SO that transforms the files into an arrow table, but this requires the table to be loaded into memory, which I can't because the table is too large to fit into memory.

I can of course do an export into a csv and load it in chunks into vaex, then export it into a column-format hdf, but I don't think that should be the purpose of two modules for big objects.

Is there any option I am missing and that would be compatible to "bridge" the two modules without either loading the full table into memory, or having to read/write the dataset twice?

score 0 · Answer 1 · edited Apr 28 '20 at 00:25

0

In order to open parquet with vaex you should use vaex.open and the extension of your file must be parquet.

Generate Data

fldr = "test"
os.makedirs(fldr, exist_ok=True)

n = 1_000
for i in range(10):
    fn = f"{fldr}/file{i}.parquet"
    df = pd.DataFrame(np.random.randn(n, 2), columns=["a", "b"])
    df["key"] = np.random.randint(0, high=100, size=n)
    df.to_parquet(fn, index=False)

Example: aggregation and save with dask

df = dd.read_parquet(fldr)
grp = df.groupby("key").sum()
grp.to_parquet("output")

Read with vaex

df = vaex.open("output/part.0.parquet")

edited Apr 28 '20 at 00:25

Alex Martínez Ascensión

153
4

answered Apr 27 '20 at 22:51

rpanai

12,515
2
42
64

But this assumes that `grp` can fit RAM. If it does not, this method is not effective, isn't it? Well, I guess I can use `vaex.concat` to merge the individual parquet dfs. – Alex Martínez Ascensión Apr 27 '20 at 23:16
No I'm not assuming grp fits in RAM. As I'm not computing but just saving. I guess this is explained somewhere in documentation/tutorials. If you have many dfs you can use vaex.open_many – rpanai Apr 28 '20 at 00:25
Did this help you? – rpanai Apr 28 '20 at 18:42
I'll look it up this weekend. – Alex Martínez Ascensión Apr 29 '20 at 22:37
1

I opened an issue on GitHub, and the answer there works fine. I tried yours and it failed because of memory issues. For anyone interested, here is the link: https://github.com/vaexio/vaex/issues/702 – Alex Martínez Ascensión Apr 30 '20 at 18:37
If you want a dask only solution you should play with npartitions. I'm glad you could solve this. – rpanai Apr 30 '20 at 18:54

Bridging exports and imports between dask and vaex

1 Answers1

Generate Data

Example: aggregation and save with dask

Read with vaex