0

I am trying to use h5py to write datasets in HDF5 format. The create_dataset() method has options to choose type of compression and filters. I could not find any resource so far to understand if shuffle = True and fletcher32 = True can be used together with compression = 'lzf' or 'gzip'.

f = h5py.open("my_hdf_file.h5", "w")
dset = f.create_dataset("zipped_dataset", shape=(778, 181, 128, 128), 
                                          chunks = True, 
                                          compression="gzip", 
                                          compression_opts=9, 
                                          shuffle = True)
f.close()

I know that the code above is okay and there are books and web-sources which show examples of similar type as well. But I could not find any discussion on using shuffle + fletcher32 + gzip/lzf.

I would like to understand the benefit of using both shuffle and fletcher32 simultaneously (if that's at all possible/advisable). If anyone could explain why this should or should not be done it will be very helpful.

Resources:

  1. http://docs.h5py.org/en/latest/high/dataset.html#dataset-compression
  2. http://docs.h5py.org/en/latest/high/group.html#Group.create_dataset
  3. Python and HDF5: Book by Andrew Colette - Filters and Compression
  4. This answer to this stackoverflow question

List of all available filters: https://portal.hdfgroup.org/display/support/Filters

CypherX
  • 7,019
  • 3
  • 25
  • 37
  • CypherX, you have certainly done your research! I inherit HDF5 files created by another application, so optimization hasn't been an issue (yet), and I mostly use Pytables. Pytables has an interesting discussion here: [Optimization tips](https://www.pytables.org/usersguide/optimization.html). The **HDF Group** has 2 blogs that might help: [Performance Tuning](https://www.hdfgroup.org/2017/05/hdf5-data-compression-demystified-2-performance-tuning/) It has links several additional references. Good luck. – kcw78 May 25 '19 at 16:38
  • @kcw78: Thank you. Those links helped somewhat. I would need to make a deep dive though. But the good thing is, so far I used to think that PyTables can only be used for tabular data. Thanks to your post, I checked and learned that [**PyTables could be used for working with multidimensional arrays**](https://stackoverflow.com/questions/8843062/python-how-to-store-a-numpy-multidimensional-array-in-pytables) as well. – CypherX May 27 '19 at 00:44
  • Yes Pytables supports multidimensional arrays. The info in that thread is "a little dated". Use an EArray (Extendable Array) if you need to add data to the array after initial creation .New rows can be added to the end of an enlargeable array by using the `EArray.append()` method. – kcw78 May 27 '19 at 16:02

0 Answers0