I have a Python program that uses Pytables and queries a table in this simple manner:
def get_element(table, somevar):
rows = table.where("colname == somevar")
row = next(rows, None)
if row:
return elem_from_row(row)
To reduce the query time, I decided to try to sort the table with table.copy(sortby='colname')
. This indeed improved the query time (spent in where
), but it increased the time spent in the next()
built-in function by several orders of magnitude! What could be the reason?
This slowdown occurs only when there is another column in the table, and the slowdown increases with the element size of that other column.
To help me understand the problem and make sure this was not related to something else in my program, I made this minimum working example reproducing the problem:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import tables
import time
import sys
def create_set(sort, withdata):
#Table description with or without data
tabledesc = {
'id': tables.UIntCol()
}
if withdata:
tabledesc['data'] = tables.Float32Col(2000)
#Create table with CSI'ed id
fp = tables.open_file('tmp.h5', mode='w')
table = fp.create_table('/', 'myset', tabledesc)
table.cols.id.create_csindex()
#Fill the table with sorted ids
row = table.row
for i in xrange(500):
row['id'] = i
row.append()
#Force a sort if asked for
if sort:
newtable = table.copy(newname='sortedset', sortby='id')
table.remove()
newtable.rename('myset')
fp.flush()
return fp
def get_element(table, i):
#By construction, i always exists in the table
rows = table.where('id == i')
row = next(rows, None)
if row:
return {'id': row['id']}
return None
sort = sys.argv[1] == 'sort'
withdata = sys.argv[2] == 'withdata'
fp = create_set(sort, withdata)
start_time = time.time()
table = fp.root.myset
for i in xrange(500):
get_element(table, i)
print("Queried the set in %.3fs" % (time.time() - start_time))
fp.close()
And here is some console output showing the figures:
$ ./timedset.py nosort nodata Queried the set in 0.718s $ ./timedset.py sort nodata Queried the set in 0.003s $ ./timedset.py nosort withdata Queried the set in 0.597s $ ./timedset.py sort withdata Queried the set in 5.846s
Some notes:
- The rows are actually sorted in all cases, so it seems to be linked to the table being aware of the sort rather than just the data being sorted.
- If instead of creating the file, I read it from disk, same results.
- The issue occurs only when the data column is present, even though I never write to it nor read it. I noticed that the time difference increases "in stages" when the size of the column (the number of floats) increases. The slowdown must be linked with internal data movements or I/O:
- If I don't use the
next
function, but instead use afor row in rows
and trust that there is only one result, the slowdown still occurs.
Accessing an element from a table by some sort of id (sorted or not) sounds like a basic feature, I must be missing the typical way of doing it with pytables. What is it? And why such a terrible slowdown? Is it a bug that I should report?