Load H5 file and store same data as new H5 file in Python: File size increases

Question

I have an H5 file, named file.h5, which stores an infrared image. This file is of size 282KB:

$ ls -l -sh file.h5
688 -rw-r--r--  1 user  staff   282K Feb  2 00:25 file.h5

First, I load the file in python using the library h5py.

>> import h5py
>> hf = h5py.File('file.h5', 'r')
>> data = hf['infrared'][:]

Then, I store this same data (retrieved from a 282 KB H5 file) as a new H5 file.

>> hf2 = h5py.File('file2.h5', 'w')
>> hf2.create_dataset('infrared', data=data)

Since no processing on the data nor new fields have been added to the H5 file, I would expect the exact same size. However, to my surprise, I end up with a new H5 file of 2 MB size!!

$ ls -l -sh file2.h5
4104 -rw-r--r--  1 user  staff   2.0M Feb  2 00:39 file2.h5

EDIT: Having a closer look

In the following, I list some of the parameters suggested in the comments for each of the datasets (from old and new files).

Dataset in old file (hf)

>> hf['infrared']
<HDF5 dataset "infrared": shape (512, 512), type "<f8">
>> hf['infrared'].size
262144
>> hf['infrared'].shape
(512, 512)
>> hf['infrared'].dtype
dtype('float64')
>> hf['infrared'].chunks
(256, 256)
>> hf['infrared'].compression
'gzip'
>> hf['infrared'].shuffle
False

Dataset in new file (hf2)

>> hf2['infrared']
<HDF5 dataset "infrared": shape (512, 512), type "<f8">
>> hf2['infrared'].size
262144
>> hf2['infrared'].shape
(512, 512)
>> hf2['infrared'].dtype
dtype('float64')
>> hf2['infrared'].chunks
>> hf2['infrared'].compression
>> hf2['infrared'].shuffle
False

Random thought: if you open the 2 MB file and save it again, how big is the resulting file? — Kevin, Feb 01 '18 at 16:33
Same size is preserved then (2MB). I did not generate `file.h5`... Is there any way to store data as a H5 file in a compressed manner? Other than that I don't understand why `file.h5` is so small. — lucasrodesg, Feb 01 '18 at 16:34
can you post the following parameters for both datasets? `dset.size, dset.shape, dset.dtype, dset.chunks, dset.compression, dset.shuffle` - there might be an obvious solution. — jpp, Feb 01 '18 at 16:38
I edited the post. Looks like the original H5 had some compression? — lucasrodesg, Feb 01 '18 at 16:48
Yes, the original has compression, the new one does not. Try `hf2.create_dataset('infrared', data=data, compression='gzip')` when creating your fresh dataset. — jpp, Feb 01 '18 at 16:57
Looks like this does the trick, Thanks! Don't know why but there is still a difference in sizes, new file now has 310 KB. I used same chunk size. I believe there might be another dataset parameter I am missing. Takeaway: Check Dataset attributes! — lucasrodesg, Feb 01 '18 at 17:01
gzip has a "compression level", given as the generic option `compression_opts`, see http://docs.h5py.org/en/latest/high/dataset.html#filter-pipeline — Pierre de Buyl, Feb 01 '18 at 20:14

Load H5 file and store same data as new H5 file in Python: File size increases

EDIT: Having a closer look

Dataset in old file (hf)

Dataset in new file (hf2)

0 Answers0