I have an HDF5 file that shows up as 23G on the filesystem. This seemed too big for the problem I am currently working on, so I decided to investigate.
The file contains 70 datasets of roughly 100,000 instances each (datatype is int8, compressed with gzip). I looked at the size of each dataset in my file like this:
f = h5py.File('my_file.hdf5', 'r')
names = []
f.visit(names.append)
size = 0
dataset_count = 0
for n in names:
if isinstance(f[n], h5py.Dataset):
size += f[n].size
dataset_count += 1
print("%i bytes in %i datasets out of %i items in hdf5 file."
% (size, dataset_count, len(names)))
which outputs the following:
7342650 bytes in 70 datasets out of 176 items in hdf5 file.
I don't bother with group/dataset attributes since they are capped at a certain size, and anyway there were none in the file (checked it).
Contrasting those 7,342,650 bytes with the listed filesize of 23,622,594,194 bytes, I am at a loss. What is going on here? HDF5 bug? File system error?
If I do the same loop as in the code above and transfer the data to a new file (without gzip compression), I get a file with a larger size due to the hdf5 overhead, but not nearly as big, at 58,874,232 bytes.
Hdf5 libversion is 1.8.7 and h5py is 2.5.0.