I have an HDF5 file containing a very large EARRAY that I would like to truncate in order to save disk space and process it more quickly. I am using the truncate
method on the node containing the EARRAY. pytables reports that the array has been truncated but it still takes up the same amount of space on disk.
Directory listing before truncation:
$ ll total 3694208
-rw-rw-r-- 1 chris 189 Aug 27 13:03 main.py
-rw-rw-r-- 1 chris 3782858816 Aug 27 13:00 original.hdf5
The script I am using to truncate (main.py):
import tables
filename = 'original.hdf5'
h5file = tables.open_file(filename, 'a')
print h5file
node = h5file.get_node('/recordings/0/data')
node.truncate(30000)
print h5file
h5file.close()
Output of the script. As expected, the EARRAY goes from very large to much smaller.
original.hdf5 (File) ''
Last modif.: 'Thu Aug 27 13:00:12 2015'
Object Tree:
/ (RootGroup) ''
/recordings (Group) ''
/recordings/0 (Group) ''
/recordings/0/data (EArray(43893300, 43)) ''
/recordings/0/application_data (Group) ''
original.hdf5 (File) ''
Last modif.: 'Thu Aug 27 13:00:12 2015'
Object Tree:
/ (RootGroup) ''
/recordings (Group) ''
/recordings/0 (Group) ''
/recordings/0/data (EArray(30000, 43)) ''
/recordings/0/application_data (Group) ''
Yet the file takes up almost exactly the same amount of space on disk:
ll
total 3693196
-rw-rw-r-- 1 chris 189 Aug 27 13:03 main.py
-rw-rw-r-- 1 chris 3781824064 Aug 27 13:03 original.hdf5
What am I doing wrong? How can I reclaim this disk space?
If there were a way to directly modify the contents of the earray, instead of using the truncate method, this would be even more useful for me. Something like node = node[idx1:idx2, :], so that I could select which chunk of data I want to keep. But when I use this syntax, the variable node
simply becomes a numpy array and the hdf5 file is not modified.