compressed files bigger in h5py

Question

I'm using h5py to save numpy arrays in HDF5 format from python. Recently, I tried to apply compression and the size of the files I get is bigger...

I went from things (every file has several datasets) like this

self._h5_current_frame.create_dataset(
        'estimated position', shape=estimated_pos.shape, 
         dtype=float, data=estimated_pos)

to things like this

self._h5_current_frame.create_dataset(
        'estimated position', shape=estimated_pos.shape, dtype=float,
        data=estimated_pos, compression="gzip", compression_opts=9)

In particular examples, the size of the compressed file is 172K and that of the uncompressed file is 72K (and h5diff reports both files are equal). I tried a more basic example and it works as expected...but not in my program.

How is that possible? I don't think gzip algorithm ever gives a bigger compressed file, so it's probably related with h5py and use thereof :-/ Any ideas?

Cheers!!

EDIT:

At the sight of the output from h5stat, it seems the compressed version saves a lot of metadata (in the last few lines of the output)

compressed file

Filename: res_totolaca_jue_2015-10-08_17:06:30_19387.hdf5
File information
    # of unique groups: 21
    # of unique datasets: 56
    # of unique named datatypes: 0
    # of unique links: 0
    # of unique other: 0
    Max. # of links to object: 1
    Max. # of objects in group: 5
File space information for file metadata (in bytes):
    Superblock extension: 0
    User block: 0
    Object headers: (total/unused)
        Groups: 3798/503
        Datasets(exclude compact data): 15904/9254
        Datatypes: 0/0
    Groups:
        B-tree/List: 0
        Heap: 0
    Attributes:
        B-tree/List: 0
        Heap: 0
    Chunked datasets:
        Index: 116824
    Datasets:
        Heap: 0
    Shared Messages:
        Header: 0
        B-tree/List: 0
        Heap: 0
Small groups (with 0 to 9 links):
    # of groups with 1 link(s): 1
    # of groups with 2 link(s): 5
    # of groups with 3 link(s): 5
    # of groups with 5 link(s): 10
    Total # of small groups: 21
Group bins:
    # of groups with 1 - 9 links: 21
    Total # of groups: 21
Dataset dimension information:
    Max. rank of datasets: 3
    Dataset ranks:
        # of dataset with rank 1: 51
        # of dataset with rank 2: 3
        # of dataset with rank 3: 2
1-D Dataset information:
    Max. dimension size of 1-D datasets: 624
    Small 1-D datasets (with dimension sizes 0 to 9):
        # of datasets with dimension sizes 1: 36
        # of datasets with dimension sizes 2: 2
        # of datasets with dimension sizes 3: 2
        Total # of small datasets: 40
    1-D Dataset dimension bins:
        # of datasets with dimension size 1 - 9: 40
        # of datasets with dimension size 10 - 99: 2
        # of datasets with dimension size 100 - 999: 9
        Total # of datasets: 51
Dataset storage information:
    Total raw data size: 33602
    Total external raw data size: 0
Dataset layout information:
    Dataset layout counts[COMPACT]: 0
    Dataset layout counts[CONTIG]: 2
    Dataset layout counts[CHUNKED]: 54
    Number of external files : 0
Dataset filters information:
    Number of datasets with:
        NO filter: 2
        GZIP filter: 54
        SHUFFLE filter: 0
        FLETCHER32 filter: 0
        SZIP filter: 0
        NBIT filter: 0
        SCALEOFFSET filter: 0
        USER-DEFINED filter: 0
Dataset datatype information:
    # of unique datatypes used by datasets: 4
    Dataset datatype #0:
        Count (total/named) = (20/0)
        Size (desc./elmt) = (14/8)
    Dataset datatype #1:
        Count (total/named) = (17/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #2:
        Count (total/named) = (10/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #3:
        Count (total/named) = (9/0)
        Size (desc./elmt) = (14/8)
    Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
    Total # of objects with small # of attributes: 0
Attribute bins:
    Total # of objects with attributes: 0
    Max. # of attributes to objects: 0
Summary of file space information:
  File metadata: 136526 bytes
  Raw data: 33602 bytes
  Unaccounted space: 5111 bytes
Total space: 175239 bytes

uncompressed file

Filename: res_totolaca_jue_2015-10-08_17:03:04_19267.hdf5
File information
    # of unique groups: 21
    # of unique datasets: 56
    # of unique named datatypes: 0
    # of unique links: 0
    # of unique other: 0
    Max. # of links to object: 1
    Max. # of objects in group: 5
File space information for file metadata (in bytes):
    Superblock extension: 0
    User block: 0
    Object headers: (total/unused)
        Groups: 3663/452
        Datasets(exclude compact data): 15904/10200
        Datatypes: 0/0
    Groups:
        B-tree/List: 0
        Heap: 0
    Attributes:
        B-tree/List: 0
        Heap: 0
    Chunked datasets:
        Index: 0
    Datasets:
        Heap: 0
    Shared Messages:
        Header: 0
        B-tree/List: 0
        Heap: 0
Small groups (with 0 to 9 links):
    # of groups with 1 link(s): 1
    # of groups with 2 link(s): 5
    # of groups with 3 link(s): 5
    # of groups with 5 link(s): 10
    Total # of small groups: 21
Group bins:
    # of groups with 1 - 9 links: 21
    Total # of groups: 21
Dataset dimension information:
    Max. rank of datasets: 3
    Dataset ranks:
        # of dataset with rank 1: 51
        # of dataset with rank 2: 3
        # of dataset with rank 3: 2
1-D Dataset information:
    Max. dimension size of 1-D datasets: 624
    Small 1-D datasets (with dimension sizes 0 to 9):
        # of datasets with dimension sizes 1: 36
        # of datasets with dimension sizes 2: 2
        # of datasets with dimension sizes 3: 2
        Total # of small datasets: 40
    1-D Dataset dimension bins:
        # of datasets with dimension size 1 - 9: 40
        # of datasets with dimension size 10 - 99: 2
        # of datasets with dimension size 100 - 999: 9
        Total # of datasets: 51
Dataset storage information:
    Total raw data size: 50600
    Total external raw data size: 0
Dataset layout information:
    Dataset layout counts[COMPACT]: 0
    Dataset layout counts[CONTIG]: 56
    Dataset layout counts[CHUNKED]: 0
    Number of external files : 0
Dataset filters information:
    Number of datasets with:
        NO filter: 56
        GZIP filter: 0
        SHUFFLE filter: 0
        FLETCHER32 filter: 0
        SZIP filter: 0
        NBIT filter: 0
        SCALEOFFSET filter: 0
        USER-DEFINED filter: 0
Dataset datatype information:
    # of unique datatypes used by datasets: 4
    Dataset datatype #0:
        Count (total/named) = (20/0)
        Size (desc./elmt) = (14/8)
    Dataset datatype #1:
        Count (total/named) = (17/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #2:
        Count (total/named) = (10/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #3:
        Count (total/named) = (9/0)
        Size (desc./elmt) = (14/8)
    Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
    Total # of objects with small # of attributes: 0
Attribute bins:
    Total # of objects with attributes: 0
    Max. # of attributes to objects: 0
Summary of file space information:
  File metadata: 19567 bytes
  Raw data: 50600 bytes
  Unaccounted space: 5057 bytes
Total space: 75224 bytes

Did you copy the compressed versions of the arrays to a new .hdf5 file, or did you try to overwrite the ones in the existing file? HDF5 has no mechanism for freeing unused space, so if you made a compressed copy of each array within the same file and then deleted the original, your file size would likely increase to the size of the originals plus the compressed copies of the arrays. In that case you could use [`h5repack`](https://www.hdfgroup.org/HDF5/doc/RM/Tools.html#Tools-Repack) to make a new copy of the file and reclaim the unused space. — ali_m, Oct 07 '15 at 18:49
I generated both files independently, so I don't think it's related to that. Actually, the problem seems to be the compressed file is mostly metadata :-/ (please check the EDIT above) — manu, Oct 08 '15 at 15:22

ali_m · Accepted Answer · 2015-10-09T10:26:01.943

First, here's a reproducible example:

import h5py
from scipy.misc import lena

img = lena()    # some compressible image data

f1 = h5py.File('nocomp.h5', 'w')
f1.create_dataset('img', data=img)
f1.close()

f2 = h5py.File('complevel_9.h5', 'w')
f2.create_dataset('img', data=img, compression='gzip', compression_opts=9)
f2.close()

f3 = h5py.File('complevel_0.h5', 'w')
f3.create_dataset('img', data=img, compression='gzip', compression_opts=0)
f3.close()

Now let's look at the file sizes:

~$ h5stat -S nocomp.h5
Filename: nocomp.h5
Summary of file space information:
  File metadata: 1304 bytes
  Raw data: 2097152 bytes
  Unaccounted space: 840 bytes
Total space: 2099296 bytes

~$ h5stat -S complevel_9.h5
Filename: complevel_9.h5
Summary of file space information:
  File metadata: 11768 bytes
  Raw data: 302850 bytes
  Unaccounted space: 1816 bytes
Total space: 316434 bytes

~$ h5stat -S complevel_0.h5
Filename: complevel_0.h5
Summary of file space information:
  File metadata: 11768 bytes
  Raw data: 2098560 bytes
  Unaccounted space: 1816 bytes
Total space: 2112144 bytes

In my example, compression with gzip -9 makes sense - although it requires an extra ~10kB of metadata, this is more than outweighed by a ~1794kB decrease in the size of the image data (about a 7:1 compression ratio). The net result is a ~6.6 fold reduction in total file size.

However, in your example the compression only reduces the size of your raw data by ~16kB (a compression ratio of about 1.5:1), which is massively outweighed by a 116kB increase in the size of the metadata. The reason why the increase in metadata size is so much larger than for my example is probably because your file contains 56 datasets rather than just one.

Even if gzip magically reduced the size of your raw data to zero you would still end up with a file that was ~1.8 times larger than the uncompressed version. The size of the metadata is more or less guaranteed to scale sublinearly with the size of your arrays, so if your datasets were much larger then you would start to see some benefit from compressing them. As it stands, your array is so small that it's unlikely that you'll gain anything from compression.

Update:

The reason why the compressed version needs so much more metadata is not really to do with the compression per se, but rather to do with the fact that in order to use compression filters the dataset needs to be split into fixed-size chunks. Presumably a lot of the extra metadata is being used to store the B-tree that is needed to index the chunks.

f4 = h5py.File('nocomp_autochunked.h5', 'w')
# let h5py pick a chunk size automatically
f4.create_dataset('img', data=img, chunks=True)
print(f4['img'].chunks)
# (32, 64)
f4.close()

f5 = h5py.File('nocomp_onechunk.h5', 'w')
# make the chunk shape the same as the shape of the array, so that there 
# is only one chunk
f5.create_dataset('img', data=img, chunks=img.shape)
print(f5['img'].chunks)
# (512, 512)
f5.close()

f6 = h5py.File('complevel_9_onechunk.h5', 'w')
f6.create_dataset('img', data=img, chunks=img.shape, compression='gzip',
                  compression_opts=9)
f6.close()

And the resulting file sizes:

~$ h5stat -S nocomp_autochunked.h5
Filename: nocomp_autochunked.h5
Summary of file space information:
  File metadata: 11768 bytes
  Raw data: 2097152 bytes
  Unaccounted space: 1816 bytes
Total space: 2110736 bytes

~$ h5stat -S nocomp_onechunk.h5
Filename: nocomp_onechunk.h5
Summary of file space information:
  File metadata: 3920 bytes
  Raw data: 2097152 bytes
  Unaccounted space: 96 bytes
Total space: 2101168 bytes

~$ h5stat -S complevel_9_onechunk.h5
Filename: complevel_9_onechunk.h5
Summary of file space information:
  File metadata: 3920 bytes
  Raw data: 305051 bytes
  Unaccounted space: 96 bytes
Total space: 309067 bytes

It's obvious that chunking is what incurs the extra metadata rather than compression, since nocomp_autochunked.h5 contains exactly the same amount of metadata as complevel_0.h5 above, and introducing compression to the chunked version in complevel_9_onechunk.h5 made no difference to the total amount of metadata.

Increasing the chunk size such that the array is stored as a single chunk reduced the amount of metadata by a factor of about 3 in this example. How much difference this would make in your case will probably depend on how h5py automatically selects a chunk size for your input dataset. Interestingly this also resulted in a very slight reduction in the compression ratio, which is not what I would have predicted.

Bear in mind that there are also disadvantages to having larger chunks. Whenever you want to access a single element within a chunk, the whole chunk needs to be decompressed and read into memory. For a large dataset this can be disastrous for performance, but in your case the arrays are so small that it's probably not worth worrying about.

Another thing you should consider is whether you can store your datasets within a single array rather than lots of small arrays. For example, if you have K 2D arrays of the same dtype that each have dimensions MxN then you could store them more efficiently in a KxMxN 3D array rather than lots of small datasets. I don't know enough about your data to know whether this is feasible.

You are probably right...though it seems to me that the amount of overhead (metadata) for adding compression is insanely high. I just run an example in which bigger size files are generated (by way of adding more datasets of the same size) and in that case, the metadata in the compressed file is about 3M whereas the raw data is about 1.6M. I was naively assuming that adding compression would amount to apply gzip just before saving each piece of data (metadata being "algorithm=gzip")... Seemingly, it's not so simple — manu, Oct 09 '15 at 07:38
See my update - the underlying issue is chunking rather than compression per se — ali_m, Oct 09 '15 at 10:16

compressed files bigger in h5py

compressed file

uncompressed file

1 Answers1

Update:

Linked