What is the best way to query a pytable column with many values?

Question

I have a 11 columns x 13,470,621 rows pytable. The first column of the table contains a unique identifier to each row (this identifier is always only present once in the table).

This is how I select rows from the table at the moment:

my_annotations_table = h5r.root.annotations

# Loop through table and get rows that match gene identifiers (column labeled gene_id).
for record in my_annotations_table.where("(gene_id == b'gene_id_36624' ) | (gene_id == b'gene_id_14701' ) | (gene_id == b'gene_id_14702')"):
    # Do something with the data.

Now this works fine with small datasets, but I will need to routinely perform queries in which I can have many thousand of unique identifiers to match for in the table's gene_id column. For these larger queries, the query string can quickly get very large and I get an exception:

  File "/path/to/my/software/python/python-3.9.0/lib/python3.9/site-packages/tables/table.py", line 1189, in _required_expr_vars
    cexpr = compile(expression, '<string>', 'eval')
RecursionError: maximum recursion depth exceeded during compilation

I've looked at this question (What is the PyTables counterpart of a SQL query "SELECT col2 FROM table WHERE col1 IN (val1, val2, val3...)"?), which is somehow similar to mine, but was not satisfactory.

I come from an R background where we often do these kinds of queries (i.e. my_data_frame[my_data_frame$gene_id %in% c("gene_id_1234", "gene_id_1235"),] and was wondering if there was comparable solution that I could use with pytables.

Thanks very much,

score 1 · Answer 1 · answered Nov 17 '22 at 00:55

1

Another approach to consider is combining 2 functions: Table.get_where_list() with Table.read_coordinates()

Table.get_where_list(): gets the row coordinates fulfilling the given condition.
Table.read_coordinates(): Gets a set of rows given their coordinates (in a list), and returns as a (record) array.

The code would look something like this:

my_annotations_table = h5r.root.annotations  
gene_name_list = ['gene_id_36624', 'gene_id_14701', 'gene_id_14702']
# Loop through gene names and get rows that match gene identifiers (column labeled gene_id)
gene_row_list = []
for gene_name in gene_name_list:
    gene_rows = my_annotations_table.get_where_list("gene_id == gene_name")) 
    gene_row_list.extend(gene_rows)

# Retieve all of the data in one call
gene_data_arr = my_annotations_table.read_coordinates(gene_row_list)

answered Nov 17 '22 at 00:55

kcw78

7,131
3
12
44

Thanks, these two functions are useful. I tried to implement them and it worked fine. I also found that querying in chunks of 31 gene_ids (instead of one by one) also improved the speed. I'm also wondering if creating beforehand a simple key-value dictionary associating gene_ids to the table row numbers and then querying this dictionary to get row numbers would speed up things as it would allow by-passing the get_where_list() operation...(?) I'm hypothesizing that querying this dictionary would be faster than querying the pytable. – julio514 Nov 18 '22 at 19:15
Good question. I suspect a dictionary of gene_ids: row numbers will be much faster that repeatedly querying the table. How are you going to create the dictionary? And where will you store? – kcw78 Nov 18 '22 at 19:36

julio514 · Answer 2 · 2022-11-16T20:45:53.607

Okay, I managed to do some satisfactory improvements on this.

1st: optimize the table (with the help of the documentation - https://www.pytables.org/usersguide/optimization.html)

Create table. Make sure to specify the expectedrows=<int> arg as it has the potential to increase the query speed.

table = h5w.create_table("/", 'annotations', 
    DataDescr, "Annotation table unindexed", 
    expectedrows=self._number_of_genes, 
    filters=tb.Filters(complevel=9, complib='blosc')
    #tb comes from import tables as tb ...

I also modified the input data so that the gene_id_12345 fields are simple integers (gene_id_12345 becomes 12345). Once the table is populated with its 13,470,621 entries (i.e. rows), I created a complete sorted index based on the gene_id column (Column.create_csindex()) and sorted it.

table.cols.gene_id.create_csindex()
table.copy(overwrite=True, sortby='gene_id', newname="Annotation table", checkCSI=True)
# Just make sure that the index is usable. Will print an empty list if not.
print(table.will_query_use_indexing('(gene_id == 57403)'))

2nd - The table is optimized, but I still can't query thousands of gene_ids at a time. So I simply separated them in chunks of 31 gene_ids (yes 31 was the absolute maximum, 32 was too much apparently).

I did not perform benchmarks, but querying ~8000 gene_ids now takes approximately 10 seconds which is acceptable for my needs.

What is the best way to query a pytable column with many values?

2 Answers2

Linked