What is the most memory efficient way to combine read_sorted and Expr in pytables?

Question

I am looking for the most memory efficient way to combine reading a Pytables table (columns: x,y,z) in a sorted order(z column has a CSI) and evaluating an expression like

x+a*y+b*z

where a and b are constant. Up until now my only solution was to copy the entire table with the "sortyby=z" flag and then evaluating the expression piece-wise on the table.

Note: I want to keep the result x+a*y+b*z in memory to do some reduction operations on it which are not available directly in Pytables and then save it into a new Pytables table.

score 2 · Accepted Answer · answered Feb 16 '14 at 10:40

2

There are two basic options, depending on if you need to iterate in a sorted fashion or not.

If you need to iterate over the table in a sorted table, then the reading in will be much more expensive than computing the expression. Thus you should efficiently read in using Table.read_sorted() and compute this expression in a list comprehension, or similar:

a = [row['x']+a*row['y']+b*row['z'] for row in 
     tab.read_sorted('z', checkCSI=True)]

If you don't need to iterate in a sorted manner (which it doesn't look like you do), you should set up and evaluate the expression using the Expr class, read in the CSI from the column, and apply this to expression results. This would look something like:

x = tab.cols.x
y = tab.cols.y
z = tab.cols.z
expr = tb.Expr('x+a*y+b*z')
unsorted_res = expr.eval()
idx = z.read_indices()
sorted_res = unsored_res[idx]

answered Feb 16 '14 at 10:40

Anthony Scopatz

3,265
2
15
14

Thank you for this answer, I had not thought about using the read_indices method. – Ben K. Feb 17 '14 at 09:53
The reading in sorted values does seem to be the most expensive part of my operation. Do you have any advice on how to optimize this part, by using small chunksizes, compression ratios? – Ben K. Feb 17 '14 at 10:31
There are lots of tricks, but performance always comes down to experimenting with your dataset. See http://pytables.github.io/usersguide/optimization.html for more info. – Anthony Scopatz Feb 18 '14 at 05:57
I guess it is not possible to perform tests on a smaller data set? Finally I want to thank you again for your answers and the great work on the Pytables package. It has helped me to, in a matter of a month or so, make a useable app from scratch analysing several GB of data. – Ben K. Feb 18 '14 at 08:57
Awesome! Thanks for using pytables :) – Anthony Scopatz Feb 18 '14 at 15:07

What is the most memory efficient way to combine read_sorted and Expr in pytables?

1 Answers1

Linked