7

Using h5py to create a hdf5-file with many datasets, I encounter a massive Speed drop after ca. 2,88 mio datasets. What is the reason for this?

I assume that the limit of the tree structure for the datasets is reached and so the tree has to be reordered, which is very time consuming.

Here is a short example:

import h5py
import time

hdf5_file = h5py.File("C://TEMP//test.hdf5")

barrier = 1
start = time.clock()
for i in range(int(1e8)):
    hdf5_file.create_dataset(str(i), [])
    td = time.clock() - start
    if td > barrier:
        print("{}: {}".format(int(td), i))
        barrier = int(td) + 1

    if td > 600: # cancel after 600s
        break

Time measurement for key creation

edit:

By grouping the datasets this limitation can be avoided:

import h5py
import time

max_n_keys = int(1e7)
max_n_group = int(1e5)

hdf5_file = h5py.File("C://TEMP//test.hdf5", "w")
group_key= str(max_n_group)
hdf5_file.create_group(group_key)

barrier=1
start = time.clock()
for i in range(max_n_keys):

    if i>max_n_group:
        max_n_group += int(1e5)
        group_key= str(max_n_group)
        hdf5_file.create_group(group_key)

    hdf5_file[group_key].create_dataset(str(i), data=[])
    td = time.clock() - start
    if td > barrier:
        print("{}: {}".format(int(td), i))
        barrier = int(td) + 1

Time measurement for key creation with grouping

setzberg
  • 91
  • 5
  • 1
    Since you did plot a curve of the processing time, maybe you can add it to the question. Also, what's the use case for having several millions datasets in a single file? Are you sure you don't want a single dataset with millions of rows? – Djizeus Feb 11 '16 at 10:22

1 Answers1

1

Following documentation of hdf5 group found at MetaData caching, I was able to push limit where performances are drastically dropping. Basically, I called (in C/C++, don't know how to access similar HDF5 function from python) H5Fset_mdc_config(), and changed max_size value in the config parameter, to 128*1024*124

Doing so, I was able to created 4 times more datasets.

Hope it helps.

Joël Conraud
  • 141
  • 1
  • 4