Workflow for modifying an hdf5 file in vaex

Question

As sort of follow on to my previous question [1], is there a way to open a hdf5 dataset in vaex, perform operations and then store the results to the same dataset?

I tried the following:

import vaex as vx

vxframe = vx.open('somedata.hdf5')
vxframe = some_transformation(vxframe)
vxframe.export_hdf5('somedata.hdf5')

This results in the error OSError: Unable to create file (unable to truncate a file which is already open), so h5py can't write to the file while it is open. Is there another workflow to achieve this? I can write to another file as a workaround, but that seems quite inefficient as (I imagine) it has to copy all the data that has not changed as well.

[1] Convert large hdf5 dataset written via pandas/pytables to vaex

Maarten Breddels · Accepted Answer · 2020-01-07T18:03:58.543

4

Copying to a new file would not be less efficient than writing to itself (at least not for this example), since it will have to write the same amount of bytes. I also would not recommend it, since if you make a mistake, you will mess up your data.

Exporting data is actually quite efficient, but even better, you can also choose to just export the columns you want:

df = vaex.open('somedata.hdf5')
df2 = some_transformation(df)
df2[['new_column1', 'new_columns2']].export('somedata_extra.hdf5')
...
# next time
df = vaex.open('somedata.hdf5')
df2 = vaex.open('somedata_extra.hdf5')
df = df.join(df2)  # merge without column name will merge on rows basis

We used this approach alot, to create auxiliary datasets on disk that were precomputed. Joining them back (on row bases) is instant, it does not take any time or memory.

edited Jan 07 '20 at 18:03

answered Dec 21 '19 at 18:00

Maarten Breddels

1,344
10
12

Sorry I'm getting back so late but I always get a `AttributeError: 'Hdf5MemoryMapped' object has no attribute 'merge'` when i do this. – sobek Jan 07 '20 at 14:53
1

That should have been `join`, merge does not exist, sorry for the confusion. – Maarten Breddels Jan 07 '20 at 18:03
you can also do ```df = vaex.open('somedata*.hdf5')``` which will automatically open and combine the files. – chris.currin Aug 03 '20 at 10:52

Workflow for modifying an hdf5 file in vaex

1 Answers1