Why is querying a table so much slower after sorting it?

Question

I have a Python program that uses Pytables and queries a table in this simple manner:

def get_element(table, somevar):
    rows = table.where("colname == somevar")
    row = next(rows, None)
    if row:
        return elem_from_row(row)

To reduce the query time, I decided to try to sort the table with table.copy(sortby='colname'). This indeed improved the query time (spent in where), but it increased the time spent in the next() built-in function by several orders of magnitude! What could be the reason?

This slowdown occurs only when there is another column in the table, and the slowdown increases with the element size of that other column.

To help me understand the problem and make sure this was not related to something else in my program, I made this minimum working example reproducing the problem:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import tables
import time
import sys


def create_set(sort, withdata):
    #Table description with or without data
    tabledesc = {
        'id': tables.UIntCol()
    }
    if withdata:
        tabledesc['data'] = tables.Float32Col(2000)

    #Create table with CSI'ed id
    fp = tables.open_file('tmp.h5', mode='w')
    table = fp.create_table('/', 'myset', tabledesc)
    table.cols.id.create_csindex()

    #Fill the table with sorted ids
    row = table.row
    for i in xrange(500):
        row['id'] = i
        row.append()

    #Force a sort if asked for
    if sort:
        newtable = table.copy(newname='sortedset', sortby='id')
        table.remove()
        newtable.rename('myset')
    fp.flush()
    return fp


def get_element(table, i):
    #By construction, i always exists in the table
    rows = table.where('id == i')
    row = next(rows, None)
    if row:
        return {'id': row['id']}
    return None


sort = sys.argv[1] == 'sort'
withdata = sys.argv[2] == 'withdata'
fp = create_set(sort, withdata)

start_time = time.time()
table = fp.root.myset
for i in xrange(500):
    get_element(table, i)
print("Queried the set in %.3fs" % (time.time() - start_time))
fp.close()

And here is some console output showing the figures:

$ ./timedset.py nosort nodata
Queried the set in 0.718s

$ ./timedset.py sort nodata
Queried the set in 0.003s

$ ./timedset.py nosort withdata
Queried the set in 0.597s

$ ./timedset.py sort withdata
Queried the set in 5.846s

Some notes:

The rows are actually sorted in all cases, so it seems to be linked to the table being aware of the sort rather than just the data being sorted.
If instead of creating the file, I read it from disk, same results.
The issue occurs only when the data column is present, even though I never write to it nor read it. I noticed that the time difference increases "in stages" when the size of the column (the number of floats) increases. The slowdown must be linked with internal data movements or I/O:
If I don't use the next function, but instead use a for row in rows and trust that there is only one result, the slowdown still occurs.

Accessing an element from a table by some sort of id (sorted or not) sounds like a basic feature, I must be missing the typical way of doing it with pytables. What is it? And why such a terrible slowdown? Is it a bug that I should report?

Djizeus · Accepted Answer · 2015-10-07T08:32:58.310

I finally understood what's going on.

Long story short

The root cause is a bug and it was on my side: I was not flushing the data before making the copy in case of sort. As a result, the copy was based on data that was not complete, and so was the new sorted table. This is what caused the slowdown, and flushing when appropriate led to a less surprising result:

...
#Fill the table with sorted ids
row = table.row
for i in xrange(500):
    row['id'] = i
    row.append()
fp.flush()  # <--

#Force a sort if asked for
if sort:
    newtable = table.copy(newname='sortedset', sortby='id')
    table.remove()
    newtable.rename('myset')
    fp.flush()  # <--

return fp
...

But why?

I realized my mistake when I decided to inspect and compare the structure and data of the tables "not sorted" vs "sorted". I noticed that in the sorted case, the table had less rows. The number varied seemingly randomly from 0 to about 450 depending on the size of the data column. Moreover, in the sorted table, the id of all the rows was set to 0. I guess that when creating a table, pytables initializes the columns and may or may not pre-create some of the rows with some initial value. This "may or may not" probably depends on the size of the row and the computed chunksize.

As a result, when querying the sorted table, all queries but the one with id == 0 had no result. I initially thought that raising and catching the StopIteration error was what caused the slowdown, but that would not explain why the slowdown depends on the size of the data column.

After reading some of the code from pytables (notably table.py and tableextension.pyx), I think what happens is the following: when a column is indexed, pytables will first try to use this index to fasten the search. If some matching rows are found, only these rows will be read. But if the index indicates that no row matches the query, for some reason pytables fallbacks to a "in kernel" search, which iterates over and reads all the rows. This requires reading the full rows from disk in multiple I/Os, and this is why the size of the data column mattered. Also under a certain size of that column, pytables did not "pre-create" some rows on disk, resulting in a sorted table with no row at all. This is why on the graph the search is very fast when the column size is under 525: iterating over 0 row doesn't take much time.

~~I am not clear on why the iterator fallbacks on an "in kernel" search. If the searched id is clearly out of the index bounds, I don't see any reason to search it anyway...~~ Edit: After a closer look at the code, it turns out this is because of a bug. It is present in the version I am using (3.1.1), but has been fixed in 3.2.0.

The irony

What really makes me cry is that I forgot to flush before copying only in the example of the question. In my actual program, this bug is not present! What I also did not know but found out while investigating the question is that by default pytables do not propagate indexes. This has to be required explicitly with propindexes=True. This is why the search was slower after sorting in my application...

So moral of the story:

Indexing is good: use it
But don't forget to propagate them when sorting a table
Make sure your data is on disk before reading it...

Why is querying a table so much slower after sorting it?

1 Answers1

Long story short

But why?

The irony