Scanning Bigtable by prefix with Python SDKs

Question

I'm trying to perform searches by multiple prefixes at Google Cloud Bigtable with the Python SDK. I'm using read_rows, and I can't see a good way to search by prefix explicitly.

My first option is RowSet + RowRange. I'm testing three queries, and the times that I'm getting are ~~~1.5s, ~3.5s and ~4.2s, which are an order of magnitude slower than the searches with the Node SDK (which has a filter option)~~ ~0.19, ~0.13, ~0.46.

The second option is using RowFilterChain + RowKeyRegexFilter. Performance is terrible for two of the queries: ~~~3.1s, ~70s, ~75s~~ ~0.124s, ~72s, ~69s. It looks like it's doing a full scan. This is the code section:

            regex = f'^{prefix}.*'.encode()
            filters.append(RowKeyRegexFilter(regex))

My third option is using the alternative Happybase-based SDK, which has prefix filtering. With that, I'm getting ~~~36s, ~3s, ~1s~~ ~0.4, ~0.1, ~0.17. The first query involves multiple prefixes, and it doesn't seem to have support for multiple filtering in the same request, so I'm performing as many requests as prefixes and then concatenating the iterators. The other two seem to leverage the prefix filter.

UPDATE: I deleted the first times because there was a mistake with the environment. After doing it properly, times are not bad for range query, but it seems to be room for improvement, as Happybase tests are still faster when they leverage prefix search.

Would appreciate help about using multiple prefix searches in Happybase, or actual prefix search in the main Python SDK.

norbjd · Accepted Answer · 2019-05-28T10:49:45.860

3

The read_rows method have two parameters start_key and end_key that you can use to filter efficiently rows based on the row key (see docs). Behind the scenes, this method performs a Scan, so that's why this is probably the most efficient way to filter rows based on their row keys.

For example, let's suppose you have the following row keys in your table :

a
aa
b
bb
bbb

and you want to retrieve all rows with a row key prefixed by a, you can run :

rows_with_prefix_a = my_table.read_rows(start_key="a", end_key="b")

This will only scan rows between a and b (b excluded), so this will return all rows with row key prefix a (a and aa in the previous example).

edited May 28 '19 at 10:49

answered May 28 '19 at 10:44

norbjd

10,166
4
45
80

Yep, that works, but it's not exactly the same as I might not know the start and end keys. Anyway, it's my best alternative right now, thanks! – juanignaciosl May 28 '19 at 11:08
2

Any prefix can be converted to start and end keys. If your prefix is `"abcd"`, then `start_key` should be `"abcd"` (obviously!), and `end_key` should be `"abce"`, since `"e"` is the character right after `"d"`, and `end_key` is not included in the scan. All rows with prefix `"abcd"` are between `"abcd"` and `"abce"` (think about a simple dictionary). This rule works for alphanumeric row keys, but remember that in BigTable/HBase, row keys are byte arrays so a more generic rule is that the last byte of `end_key` should be equal to the last byte of your prefix + 1. – norbjd May 28 '19 at 11:24

Scanning Bigtable by prefix with Python SDKs

1 Answers1