1

I faced some problems with decompression in zstd case. I have hdf5-format files, that was compressed in the following way:

import h5py as h5
import hdf5plugin
import sys
import os
filefrom = sys.argv[1] 
h5path  = sys.argv[2]
f = h5.File(filefrom,'r')
data = f[h5path]
shape_data = data.shape[1:]
num = data.shape[0]
initShape = (1,) + shape_data
maxShape = (num,) + shape_data
f_zstd = h5.File(filefrom.split('.')[0]+'_zstd.h5','w')
d_zstd = f_zstd.create_dataset(path_to_data, initShape, maxshape=maxShape, dtype=np.int32, chunks=initShape, **hdf5plugin.Zstd())
d_zstd[0,] = data[0,]
for i in range(num):
    d_zstd.resize((i+1,) + shape_data)
    d_zstd[i,] = data[i,]
f_zstd.close()
f.close()
    

So it compressed without any errors, but then when I try to look into the data with h5ls or h5dump it prints me out that data can't be printed, and no another way to look inside the file like reading in python3 (3.6) with h5py this compressed data is unsuccessful. I also tried h5repack (h5repack -i compressed_file.h5 -o out_file.h5 --filter=var:NONE) or the following piece of code:

import zstandard
import pathlib
import os

def decompress_zstandard_to_folder(input_file):
    input_file = pathlib.Path(input_file)
    destination_dir = os.path.dirname(input_file)
    with open(input_file, 'rb') as compressed:
        decomp = zstandard.ZstdDecompressor()
        output_path = pathlib.Path(destination_dir) / input_file.stem
        with open(output_path, 'wb') as destination:
            decomp.copy_stream(compressed, destination)

nothing succeed. In situation with h5repack no warnings or errors appeared, with the last piece of code I got this zstd.ZstdError: zstd decompressor error: Unknown frame descriptor, so as I got it means that compressed data doesn't have the appropriete headers.

I use python 3.6.7, hdf5 1.10.5. So I'm a bit confused and don't have any idea how to overcome this issue.

Will be happy for any ideas/advice!

kitsune_breeze
  • 97
  • 1
  • 11

1 Answers1

1

I wrote a simple test to validate zstd compression behavior with a simple dataset (NumPy array of int32). I can open the HDF5 file with h5py and read the dataset. (Note: I could not open with HDFView and h5repack only reports shape and type attributes, not the data.)

I suspect an undetected error in another part of your code. Have you tested your code logic without zstd compression? If not, I suggest you start there.

Code to Write example file:

import h5py as h5
import hdf5plugin
import numpy as np

data = np.arange(1_000).reshape(100,10)

with h5.File('test_zstd.h5','w') as f_zstd:
    d_zstd = f_zstd.create_dataset('zstd_data', data=data, **hdf5plugin.Zstd())

Code to Read example file:

import h5py as h5
import hdf5plugin  ## Note: plugin required to read

with h5.File('test_zstd.h5','r') as f_zstd:
    d_zstd = f_zstd['zstd_data']
    print(d_zstd.shape, d_zstd.dtype)
    print(d_zstd[0,:])
    print(d_zstd[-1,:])

Output from above:

(100, 10) int32
[0 1 2 3 4 5 6 7 8 9]
[990 991 992 993 994 995 996 997 998 999]

More on HDF5 and compression:
To use HDF5 utilities (like h5repack) to read a compressed file, the HDF5 installation needs the appropriate compression filter. Some are standard, many (including xstandard), require you to install a third party filter. Links to available plugins are here: HDF5 Registered Filter Plugins

You can verify the compression filter with h5dump by adding the -pH flag, like this:

E:\SO_68526704>h5dump -pH test_zstd.h5
HDF5 "test_zstd.h5" {
GROUP "/" {
   DATASET "zstd_data" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 100, 10 ) / ( 100, 10 ) }
      STORAGE_LAYOUT {
         CHUNKED ( 100, 10 )
         SIZE 1905 (2.100:1 COMPRESSION)
      }
      FILTERS {
         USER_DEFINED_FILTER {
            FILTER_ID 32015
            COMMENT Zstandard compression: http://www.zstd.net
         }
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_ALLOC
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_INCR
      }
   }
}
}
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • o thanks, I tried your piece of code and try to look on data with h5ls and got ```H5tools-DIAG: Error detected in HDF5:tools (1.10.5) thread 0: #000: h5tools_dump.c line 1632 in h5tools_dump_simple_dset(): H5Dread failed major: Failure in tools library minor: error in function ``` can you specify the hdf5 version that you work with? I tested my piece of code with other compression and it worked. But crashed with zstd. And i'm a bit confused – kitsune_breeze Jul 26 '21 at 16:00
  • also i tried to work with hdf5 1.12.0 and use h5ls -d for looking inside the data and again got ```Unable to print data.``` – kitsune_breeze Jul 26 '21 at 16:01
  • On the Python side, I am using conda distribution versions : `Python 3.8.3`, `h5py 3.3.0` (built with `hdf5 1.10.6`) and `hdf5plugin 3.1.1` on Windows. Try: `h5dump -pH test_zstd.h5`? Output from `h5ls test_zstd.h5`: is `zstd_data Dataset {100, 10}`. I can also open with HDFView. I don't have the zstd third party plugin installed, so can't view the data with `h5dump` or `HDFView`. `h5dump --version` returns `h5dump: Version 1.10.6` and the same for `h5ls`. – kcw78 Jul 26 '21 at 18:50
  • Ok, i'll try this, thanks! it is not the first time when you help me :) thanks a lot! – kitsune_breeze Jul 27 '21 at 09:01
  • but i'm still confused why i can't open ```test_zstd.h5``` next time like next day using this: ```with h5.File('test_zstd.h5','r') as f_zstd: d_zstd = f_zstd['zstd_data'] print(d_zstd.shape, d_zstd.dtype) print(d_zstd[0,:]) print(d_zstd[-1,:])``` – kitsune_breeze Jul 27 '21 at 12:36
  • Did you have `import hdf5plugin` in your code to read the file? You need it to read the compressed data. My test works when I include the package. and it fails when I don't include it. – kcw78 Jul 27 '21 at 16:06