1

I'm looking for the PyTables counterpart of a SQL query in the form of...

SELECT col2 FROM table WHERE col1 IN (val1, val2, val3...)

...where the condition values are stored in a Python list/tuple or numpy array of arbitrary length. The search values are determined during runtime of the application by the user.

I've tried an approach using PyTables Conditions related with in-kernel search:

options = {
            'val' : 'someval1'
            'val' : 'someval2'
           }
for row in table.where('col1 == val', options):
    print row['col2']

but it's only picking the first value.

My main goal is to build the options dictionary or list dynamically and pass it to Pytables.

THM
  • 745
  • 2
  • 8
  • 14

2 Answers2

1

This is because your options dictionary only has one element in it. The right way to do this is something like:

for row in table.where('col1 == val || col1 == val2'):
    print row['col2']
Anthony Scopatz
  • 3,265
  • 2
  • 15
  • 14
  • If this is the only possibility to select "many" values in columns with the "where" clause, then conditions seem not to be the right tool for this task. I'm experimenting currently with the numpy np.where clause directly on the table object and will post the resiults as soon as i'm done. – THM Apr 17 '14 at 22:04
  • This is limitation numexpr. I suggest that you bring it up with them. This is not the only tool for the job, of course. You could read everything into memory and then use whatever selection semantics you want. But that is less efficient because you are reading much more data from disk. – Anthony Scopatz Apr 17 '14 at 23:03
0

I think Anthony had a typo in the syntax for query. This should work:

In [1]: import numpy as np
In [2]: import tables as tb
In [3]: ra = np.fromiter(((i, i*2, i*3) for i in xrange(1000000)), dtype='i4,f4,f8')
In [4]: f = tb.open_file("mytable.h5", "w")
In [5]: t = f.create_table(f.root, "mytable", ra)
In [6]: cond = "|".join(["(f1 == %d)"%i for i in (2, 4, 10)])
In [7]: %timeit [r['f0'] for r in t.where(cond)]
100 loops, best of 3: 9.14 ms per loop

This speed is very close to my best attempt in numpy:

In [8]: %timeit np.where((ra['f1'] == 2)|(ra['f1'] == 4)|(ra['f1'] == 10))[0]
100 loops, best of 3: 8.32 ms per loop

And also, if you want to make use of the 'in' operator:

In [9]: %timeit [r['f0'] for r in t if r['f1'] in (2, 4, 10)]
10 loops, best of 3: 160 ms per loop

but in this case speed sucks.

Francesc
  • 366
  • 2
  • 3