0

I have an HDF5 file containing a very large EARRAY that I would like to truncate in order to save disk space and process it more quickly. I am using the truncate method on the node containing the EARRAY. pytables reports that the array has been truncated but it still takes up the same amount of space on disk.

Directory listing before truncation:

$ ll total 3694208
-rw-rw-r-- 1 chris        189 Aug 27 13:03 main.py
-rw-rw-r-- 1 chris 3782858816 Aug 27 13:00 original.hdf5

The script I am using to truncate (main.py):

import tables
filename = 'original.hdf5'
h5file = tables.open_file(filename, 'a')
print h5file
node = h5file.get_node('/recordings/0/data')
node.truncate(30000)
print h5file
h5file.close()

Output of the script. As expected, the EARRAY goes from very large to much smaller.

original.hdf5 (File) ''
Last modif.: 'Thu Aug 27 13:00:12 2015'
Object Tree: 
/ (RootGroup) ''
/recordings (Group) ''
/recordings/0 (Group) ''
/recordings/0/data (EArray(43893300, 43)) ''
/recordings/0/application_data (Group) ''

original.hdf5 (File) ''
Last modif.: 'Thu Aug 27 13:00:12 2015'
Object Tree: 
/ (RootGroup) ''
/recordings (Group) ''
/recordings/0 (Group) ''
/recordings/0/data (EArray(30000, 43)) ''
/recordings/0/application_data (Group) ''

Yet the file takes up almost exactly the same amount of space on disk:

ll
total 3693196
-rw-rw-r-- 1 chris        189 Aug 27 13:03 main.py
-rw-rw-r-- 1 chris 3781824064 Aug 27 13:03 original.hdf5

What am I doing wrong? How can I reclaim this disk space?

If there were a way to directly modify the contents of the earray, instead of using the truncate method, this would be even more useful for me. Something like node = node[idx1:idx2, :], so that I could select which chunk of data I want to keep. But when I use this syntax, the variable node simply becomes a numpy array and the hdf5 file is not modified.

cxrodgers
  • 4,317
  • 2
  • 23
  • 29

1 Answers1

3

As discussed in this question you can't really deallocate disk space from an existing hdf5 file. It's just not a part of how hdf5 is designed, and therefore it's not really a part of pytables. You can either load the data from the file, then rewrite it all as a new file (potentially with the same name), or you can use the command line utility h5repack to do that for you.

Community
  • 1
  • 1
farenorth
  • 10,165
  • 2
  • 39
  • 45
  • 3
    PyTables also comes with the [`ptrepack`](http://www.pytables.org/usersguide/utilities.html#ptrepack) utility for this – ali_m Aug 27 '15 at 17:36
  • 1
    `h5repack -i original.hdf5 -o smaller.hdf5` works beautifully, thanks! – cxrodgers Aug 27 '15 at 18:12