2

I'm passing thousands of .csv containing time and amplitude to a .hdf5 file. To give an example I used a small amount of .csv files which correspond to a total of ~11MB.

After passing all the .csv to hdf5, the latter has a size of ~36MB (without using compression="gzip").

By using compression="gzip" the file size is around 38MB.

I understand that hdf5 is compressing the dataset only, that is, the numpy array in my case (~500 rows with float number).

To make a comparison, I was saving all the .csv in a .json file, compressing it and then reading. I chose hdf5 due to memory issues since the json file is loaded entirely into memory with a footprint 2x to Xx times larger than the file size.

This is how I append to a dataset in a .hdf5 file.

def hdf5_dump_dataset(hdf5_filename, hdf5_data, dsetname):
        with h5py.File(hdf5_filename, 'a') as f:
            dset = f.create_dataset(dsetname, data=hdf5_data, compression="gzip", chunks=True, maxshape=(None,))

This is how I read a dataset from a .hdf5 file.

def hdf5_load_dataset(hdf5_filename, dsetname):
        with h5py.File(hdf5_filename, 'r') as f:
            dset = f[dsetname]
            return dset[...]

The folder structre with the .csv files:

root/
    folder_1/
        file_1.csv
        file_X.csv  
    folder_X/
        file_1.csv
        file_X.csv  

Inside each .csv file:

time, amplitude
1.000e-08, -1.432e-07
1.001e-08, 7.992e-07
1.003e-08, -1.838e-05
1.003e-08, 2.521e-05

script to save the .csv contents in a hdf5 file:

# csv_dict is a dict() with all folders and csv files as keys
# ex. csv_dict['folder_1']['file_1']  (without the .csv extension)

for folder in csv_dict:
    for file in csv_dict[folder]:
        path_waveform = f"{folder}/{file}.csv"
        time, amplitude = self.read_csv_return_list_of_time_amplitude(path_waveform)

        hdf5_dump_dataset(path_hdf5_waveforms, amplitude, '/'.join([folder, file, 'amplitude']))

        hdf5_dump_dataset(path_hdf5_waveforms, time, '/'.join([folder, file, 'time']))

For each .csv file in each folder I have a dataset for the time and for the amplitude. The structure of the hdfile is like this:

folder1/file_1/time
folder1/file_1/amplitude

where

time = np.array([1.000e-08, 1.001e-08, 1.003e-08, ...])  # 500 items
amplitude = np.array([-1.432e-07, 7.992e-07, -1.838e-05, ...])  # 500 items

My question is: Is there a way to compress the whole hdf5 file?

Raphael
  • 959
  • 7
  • 21
  • That is strange. The gzip compression flag should be sufficient. I can't replicate your example without your data. I created 1.7MB of data (10 folders x 10 files, 1000 rows) and imported in a similar way. My H5 file is slightly smaller (1.55MB). I read the CSV data with `genfromtext()` then accessed data by field name (arr['time'] and arr['amplitude'] to write to H5 (used your functions). Note, you can store both time and amplitude for each CSV as 1 dataset. Seems simpler to me. Also, you can use `glob()` to programmatically access the folder/file names in lieu of creating dictionaries. – kcw78 Jan 30 '20 at 16:57
  • Hi @kcw78. I read [this](https://stackoverflow.com/questions/32994766/compressed-files-bigger-in-h5py) regarding the compression. One of my folders with data has ~800MB. The hdf5 file containing the data has ~1.1GB and the compressed one almost the same size. I updated the question with data example, it's a simple numpy array with numbers. I stored time and functions separated since I don't use the `time` sometimes. – Raphael Jan 30 '20 at 17:46
  • Thanks for sharing the link. Compression efficiency involves several items: algorithm, chucking, number and size of datasets (and metadata overhead). If you combine time & amplitude as one dataset, it will reduce metadata. Then you can retrieve using the field name `h5file['group/dataset']['time']` or `h5file['group/dataset']['amplitude ']` – kcw78 Jan 30 '20 at 20:33

0 Answers0