I'm using h5py to save numpy arrays in HDF5 format from python. Recently, I tried to apply compression and the size of the files I get is bigger...
I went from things (every file has several datasets) like this
self._h5_current_frame.create_dataset(
'estimated position', shape=estimated_pos.shape,
dtype=float, data=estimated_pos)
to things like this
self._h5_current_frame.create_dataset(
'estimated position', shape=estimated_pos.shape, dtype=float,
data=estimated_pos, compression="gzip", compression_opts=9)
In particular examples, the size of the compressed file is 172K and that of the uncompressed file is 72K (and h5diff reports both files are equal). I tried a more basic example and it works as expected...but not in my program.
How is that possible? I don't think gzip algorithm ever gives a bigger compressed file, so it's probably related with h5py and use thereof :-/ Any ideas?
Cheers!!
EDIT:
At the sight of the output from h5stat
, it seems the compressed version saves a lot of metadata (in the last few lines of the output)
compressed file
Filename: res_totolaca_jue_2015-10-08_17:06:30_19387.hdf5
File information
# of unique groups: 21
# of unique datasets: 56
# of unique named datatypes: 0
# of unique links: 0
# of unique other: 0
Max. # of links to object: 1
Max. # of objects in group: 5
File space information for file metadata (in bytes):
Superblock extension: 0
User block: 0
Object headers: (total/unused)
Groups: 3798/503
Datasets(exclude compact data): 15904/9254
Datatypes: 0/0
Groups:
B-tree/List: 0
Heap: 0
Attributes:
B-tree/List: 0
Heap: 0
Chunked datasets:
Index: 116824
Datasets:
Heap: 0
Shared Messages:
Header: 0
B-tree/List: 0
Heap: 0
Small groups (with 0 to 9 links):
# of groups with 1 link(s): 1
# of groups with 2 link(s): 5
# of groups with 3 link(s): 5
# of groups with 5 link(s): 10
Total # of small groups: 21
Group bins:
# of groups with 1 - 9 links: 21
Total # of groups: 21
Dataset dimension information:
Max. rank of datasets: 3
Dataset ranks:
# of dataset with rank 1: 51
# of dataset with rank 2: 3
# of dataset with rank 3: 2
1-D Dataset information:
Max. dimension size of 1-D datasets: 624
Small 1-D datasets (with dimension sizes 0 to 9):
# of datasets with dimension sizes 1: 36
# of datasets with dimension sizes 2: 2
# of datasets with dimension sizes 3: 2
Total # of small datasets: 40
1-D Dataset dimension bins:
# of datasets with dimension size 1 - 9: 40
# of datasets with dimension size 10 - 99: 2
# of datasets with dimension size 100 - 999: 9
Total # of datasets: 51
Dataset storage information:
Total raw data size: 33602
Total external raw data size: 0
Dataset layout information:
Dataset layout counts[COMPACT]: 0
Dataset layout counts[CONTIG]: 2
Dataset layout counts[CHUNKED]: 54
Number of external files : 0
Dataset filters information:
Number of datasets with:
NO filter: 2
GZIP filter: 54
SHUFFLE filter: 0
FLETCHER32 filter: 0
SZIP filter: 0
NBIT filter: 0
SCALEOFFSET filter: 0
USER-DEFINED filter: 0
Dataset datatype information:
# of unique datatypes used by datasets: 4
Dataset datatype #0:
Count (total/named) = (20/0)
Size (desc./elmt) = (14/8)
Dataset datatype #1:
Count (total/named) = (17/0)
Size (desc./elmt) = (22/8)
Dataset datatype #2:
Count (total/named) = (10/0)
Size (desc./elmt) = (22/8)
Dataset datatype #3:
Count (total/named) = (9/0)
Size (desc./elmt) = (14/8)
Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
Total # of objects with small # of attributes: 0
Attribute bins:
Total # of objects with attributes: 0
Max. # of attributes to objects: 0
Summary of file space information:
File metadata: 136526 bytes
Raw data: 33602 bytes
Unaccounted space: 5111 bytes
Total space: 175239 bytes
uncompressed file
Filename: res_totolaca_jue_2015-10-08_17:03:04_19267.hdf5
File information
# of unique groups: 21
# of unique datasets: 56
# of unique named datatypes: 0
# of unique links: 0
# of unique other: 0
Max. # of links to object: 1
Max. # of objects in group: 5
File space information for file metadata (in bytes):
Superblock extension: 0
User block: 0
Object headers: (total/unused)
Groups: 3663/452
Datasets(exclude compact data): 15904/10200
Datatypes: 0/0
Groups:
B-tree/List: 0
Heap: 0
Attributes:
B-tree/List: 0
Heap: 0
Chunked datasets:
Index: 0
Datasets:
Heap: 0
Shared Messages:
Header: 0
B-tree/List: 0
Heap: 0
Small groups (with 0 to 9 links):
# of groups with 1 link(s): 1
# of groups with 2 link(s): 5
# of groups with 3 link(s): 5
# of groups with 5 link(s): 10
Total # of small groups: 21
Group bins:
# of groups with 1 - 9 links: 21
Total # of groups: 21
Dataset dimension information:
Max. rank of datasets: 3
Dataset ranks:
# of dataset with rank 1: 51
# of dataset with rank 2: 3
# of dataset with rank 3: 2
1-D Dataset information:
Max. dimension size of 1-D datasets: 624
Small 1-D datasets (with dimension sizes 0 to 9):
# of datasets with dimension sizes 1: 36
# of datasets with dimension sizes 2: 2
# of datasets with dimension sizes 3: 2
Total # of small datasets: 40
1-D Dataset dimension bins:
# of datasets with dimension size 1 - 9: 40
# of datasets with dimension size 10 - 99: 2
# of datasets with dimension size 100 - 999: 9
Total # of datasets: 51
Dataset storage information:
Total raw data size: 50600
Total external raw data size: 0
Dataset layout information:
Dataset layout counts[COMPACT]: 0
Dataset layout counts[CONTIG]: 56
Dataset layout counts[CHUNKED]: 0
Number of external files : 0
Dataset filters information:
Number of datasets with:
NO filter: 56
GZIP filter: 0
SHUFFLE filter: 0
FLETCHER32 filter: 0
SZIP filter: 0
NBIT filter: 0
SCALEOFFSET filter: 0
USER-DEFINED filter: 0
Dataset datatype information:
# of unique datatypes used by datasets: 4
Dataset datatype #0:
Count (total/named) = (20/0)
Size (desc./elmt) = (14/8)
Dataset datatype #1:
Count (total/named) = (17/0)
Size (desc./elmt) = (22/8)
Dataset datatype #2:
Count (total/named) = (10/0)
Size (desc./elmt) = (22/8)
Dataset datatype #3:
Count (total/named) = (9/0)
Size (desc./elmt) = (14/8)
Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
Total # of objects with small # of attributes: 0
Attribute bins:
Total # of objects with attributes: 0
Max. # of attributes to objects: 0
Summary of file space information:
File metadata: 19567 bytes
Raw data: 50600 bytes
Unaccounted space: 5057 bytes
Total space: 75224 bytes