2

As sort of follow on to my previous question [1], is there a way to open a hdf5 dataset in vaex, perform operations and then store the results to the same dataset?

I tried the following:

import vaex as vx

vxframe = vx.open('somedata.hdf5')
vxframe = some_transformation(vxframe)
vxframe.export_hdf5('somedata.hdf5')

This results in the error OSError: Unable to create file (unable to truncate a file which is already open), so h5py can't write to the file while it is open. Is there another workflow to achieve this? I can write to another file as a workaround, but that seems quite inefficient as (I imagine) it has to copy all the data that has not changed as well.

[1] Convert large hdf5 dataset written via pandas/pytables to vaex

sobek
  • 1,386
  • 10
  • 28

1 Answers1

4

Copying to a new file would not be less efficient than writing to itself (at least not for this example), since it will have to write the same amount of bytes. I also would not recommend it, since if you make a mistake, you will mess up your data.

Exporting data is actually quite efficient, but even better, you can also choose to just export the columns you want:

df = vaex.open('somedata.hdf5')
df2 = some_transformation(df)
df2[['new_column1', 'new_columns2']].export('somedata_extra.hdf5')
...
# next time
df = vaex.open('somedata.hdf5')
df2 = vaex.open('somedata_extra.hdf5')
df = df.join(df2)  # merge without column name will merge on rows basis

We used this approach alot, to create auxiliary datasets on disk that were precomputed. Joining them back (on row bases) is instant, it does not take any time or memory.

Maarten Breddels
  • 1,344
  • 10
  • 12