Efficient calculation on complete columns (pytables, hdf5, numpy)

Question

I have a simple HDF5 file (created by PyTables) with ten columns and 100000 rows. For every value I have to apply a simple linear equation, with different parameters per column and write the stuff to CSV.

My naive approach was to loop over the table:

for row in table.iterrows():
    print "%f,%f,..." % (row['a'] * 1.0 + 2.0, row['b'] * 3.0 + 4.0, ...)

But I wonder, whether it would be more efficient to select the columns and calculate them that way and later iterate over the resulting arrays:

a = numpy.add(numpy.multiply(table.cols.a, 1.0), 2.0)
b = numpy.add(numpy.multiply(table.cols.b, 3.0), 4.0)

But this is even slower, it seems.

What is the best way to do this?

score 1 · Accepted Answer · answered Sep 01 '14 at 18:18

Your performance is likely going to be limited by the writing to CSV, but other than that, this problem is exactly what numexpr was made for.

You could use the Expr.set_output method to write your result back to hdf5 instead of iterating over the result and writing to CSV directly, and then look for a more efficient method of converting this result column to CSV in a single optimized call; or find a way to do away with the CSV in the first place, because it does not make much sense to use it if performance is indeed a major concern.

Many thanks, as a PyTables newbie I never heard of this feature before. In a quick check, this is about one third faster (~400ms) than using iterrows (~600ms)! Cool! (The numpy approach is so slow, that I interrupted it after some minutes.) — , Sep 02 '14 at 15:22

Efficient calculation on complete columns (pytables, hdf5, numpy)

1 Answers1