It is commonly suggested to use a range scan via startrow
and stoprow
as opposed to a Rowkey Prefix Filter
(for example, here). The reasoning for this is because a Rowkey Prefix Filter
results in a full table scan of the rowkey, whereas a range scan via startrow
and stoprow
do not result in a full table scan. Why doesn't it? Most people say "because the rowkey is stored in lexographical order," which of course, doesn't explain why the Rowkey Prefix Filter
cannot leverage this.
At anyrate, how exactly does a range scan via startrow
and stoprow
not result in a full table scan of the rowkey?
Take this small example in python to show why I don't understand how the lexagraphical ordering of the rowkeys means anything when it comes to avoiding a full table scan:
rowkeys = ['a1', 'a2', 'a3', 'b1', 'b2', 'b3', 'c1', 'c2', 'c3']
def range_scan(startrow, stoprow):
is_found = False
for rowkey in rowkeys:
if startrow <= rowkey < stoprow:
is_found = True
yield rowkey
else:
if is_found:
raise StopIteration()
Clearly, the HBase algorithm differs from this. How does it?
TLDR: How exactly does HBase avoid a full table scan when doing a range scan with startrow and stoprow?