Hdf5 file over a thousand times bigger than the sum of its parts

Question

I have an HDF5 file that shows up as 23G on the filesystem. This seemed too big for the problem I am currently working on, so I decided to investigate.

The file contains 70 datasets of roughly 100,000 instances each (datatype is int8, compressed with gzip). I looked at the size of each dataset in my file like this:

f = h5py.File('my_file.hdf5', 'r')
names = []
f.visit(names.append)

size = 0
dataset_count = 0
for n in names:
    if isinstance(f[n], h5py.Dataset):
         size += f[n].size
         dataset_count += 1
print("%i bytes in %i datasets out of %i items in hdf5 file." 
    % (size, dataset_count, len(names)))

which outputs the following:

7342650 bytes in 70 datasets out of 176 items in hdf5 file.

I don't bother with group/dataset attributes since they are capped at a certain size, and anyway there were none in the file (checked it).

Contrasting those 7,342,650 bytes with the listed filesize of 23,622,594,194 bytes, I am at a loss. What is going on here? HDF5 bug? File system error?

If I do the same loop as in the code above and transfer the data to a new file (without gzip compression), I get a file with a larger size due to the hdf5 overhead, but not nearly as big, at 58,874,232 bytes.

Hdf5 libversion is 1.8.7 and h5py is 2.5.0.

Can you write a short script that creates an equivalent file (say with a bunch of rand values) with the same number of datasets, shapes, and compression? — John Readey, Aug 10 '16 at 14:20
Having a minimal working example would make this much simpler I agree. However when I recreate a similar hdf5 file the size is what I expect it to be. — levesque, Aug 10 '16 at 14:30
It's hard to say what the issue could be then. One possibility, if the process that is creating the file is doing a lot of adding/removing datasets, is the file can get fragmented - you can try the hdf5 repack utility to see that reduces the file size. — John Readey, Aug 11 '16 at 19:51
That is the case, I open the file regularly and I add datasets to it (I do not remove any datasets however). However, I can't believe any amount of fragmentation could result in a 58M file to grow up to 23G. Strangely enough, after comparing on different machines I found out I wasn't able to reproduce the error on my other computer. I was able to 'fix' the problem by forcing the most recent version of the protocol. Still not sure what caused it. — levesque, Aug 12 '16 at 13:07

Hdf5 file over a thousand times bigger than the sum of its parts

0 Answers0