Performance issue with loop on datasets with h5py

Question

I want to apply a simple function to the datasets contained in an hdf5 file. I am using a code similar to this

import h5py
data_sums = []

with h5py.File(input_file, "r") as f:
    for (name, data) in f["group"].iteritems():
        print name
        # data_sums.append(data.sum(1))
        data[()]  # My goal is similar to the line above but this line is enough
                  # to replicate the problem

It goes very fast at the beginning and after a certain number, reproducible to some extent, of datasets it slow down dramatically. If I comment the last line, it finishes almost instantly. It does not matter if the data are stored (here append to a list) or not: something like data[:100] as a similar effect. The number of datasets that can be treated before the drop in performance is dependent to the size of portion that is accessed at each iteration. Iterating over smaller chunks does not solve the issue.

I suppose I am filling some memory space and that the process slows down when it is full but I do not understand why.

How to circumvent this performance issue?

I run python 2.6.5 on ubuntu 10.04.

Edit: The following code does not slow down if the second line of the loop is un-commented. It does slow down without out it

f = h5py.File(path to file, "r")
list_name = f["data"].keys()
f.close()

import numpy as np

for name in list_name:
    f = h5py.File(d.storage_path, "r")
    # name = list_name[0] # with this line the issue vanishes.
    data = f["data"][name]
    tag = get_tag(name)
    data[:, 1].sum()
    print "."

    f.close()

Edit: I found out that accessing the first dimension of multidimensional datasets seems to run without issues. The problem occurs when higher dimensions are involved.

score 1 · Answer 1 · answered Sep 23 '13 at 14:54

platform?

on windows 64 bit, python 2.6.6, i have seen some weird issues when crossing a 2GB barrier (i think) if you have allocated it in small chunks.

you can see it with a script like this:

ix = []
for i in xrange(20000000):
    if i % 100000 == 0:
        print i
    ix.append('*' * 1000)

you can see that it will run pretty fast for a while, and then suddenly slow down.

but if you run it in larger blocks:

ix = []
for i in xrange(20000):
    if i % 100000 == 0:
        print i
    ix.append('*' * 1000000)

it doesn't seem to have the problem (though it will run out of memory, depending on how much you have - 8GB here).

weirder yet, if you eat the memory using large blocks, and then clear the memory (ix=[] again, so back to almost no memory in use), and then re-run the small block test, it isn't slow anymore.

i think there was some dependence on the pyreadline version - 2.0-dev1 helped a lot with these sorts of issues. but don't remember too much. when i tried it now, i don't see this issue anymore really - both slow down significant around 4.8GB, which with everything else i have running is about where it hits the limits of physical memory and starts swapping.

I run python 2.6.5 on ubuntu 10.04. Do you suggest me to update my python distribution? — M. Toya, Sep 23 '13 at 15:05
I have a similar performance issue even if I don't store the data in a list. Accessing them is enough and memory usage stays at a low percentage all along. — M. Toya, Sep 23 '13 at 15:26
well, 2.6.6 has some security fixes that 2.6.5 doesn't, but it shouldn't make any difference otherwise. is it possible that you just have varying lengths of data? i.e. it would be useful to run it and print out len(data) and see if the lengths are changing dramatically... — Corley Brigman, Oct 01 '13 at 13:18

Performance issue with loop on datasets with h5py

1 Answers1