0

I have a hierarchical row-key design, where each character is an ID of a field (we use 4 byte segments but I will stick to double digits for readability)

For example

00
0000 = child of 00
000000 = child of 0000
0001 = child of 00
000100 = child of 0001

I would like to make a hbase shell query to return the children of a node.

Right now I have the following

scan 'tableName', STARTROW=>'00',
 FILTER=>"PrefixFilter('00') AND RowFilter(=,'regexstring:^00.{1}$')"

which gives the list of children of 00, namely 0000 0001

There are more than one question here:
1. If I remove the $ sign, the performance improves dramatically (from 2 seconds to 0.2 seconds on local VM) but I also get additional results (000000 and 000100, results I don't need). Is there a reason for this dramatic performance decrease ? (since it should be an additional filter on a narrowed down list)
2. Is there a way to filter by the length of the rowkey ? (then I can ditch regex and use only startrow/endrow) - this has to be done in hbase shell. For example FILTER=>"RowKeyLengthFilter(4)"
3. I cannot use word (\w) or digit (\d) in the regex string, is there a limitation of hbase shell ? Also tried with [[:alnum:]] and [[:digit:]] (thanks for Yunnosch for the suggestion)

version = 1.1.0.1, r4de7d45cb593f98ae5d020080cbc7116d3e9d9a0, Sun May 17 12:52:10 PDT 2015

norb
  • 165
  • 2
  • 11
  • Give expected results. What are the children? In "1.", what is the question? In "2.", what is the "rowkey"? In "3.", what is the question? It looks like a statement. Do you know about "\w" == "[[:alnum:]_]" and "\d" == "[[:digit:]]" in many cases? – Yunnosch Mar 22 '17 at 18:01
  • Yep, thnaks for the points... will edit the post – norb Mar 23 '17 at 07:49

1 Answers1

0

In General:

  • your regex string only matches for 3 characters -> 000 or 001
    -- e.g. 'regexstring:^00.{2}$' would match to 4 characters/digits -> 0000
  • is there a reason why you don’t use brakets like

    scan 'tbl' , {ROWPREFIXFILTER => 'row2', FILTER => QualifierFilter (>=, 'binary:abc')) }

  • why do you don't use RowPrefixFilter (instead of STARTROW and PrefixFilter)?

regarding 3. :

you have to mask the regex string (like you do e.g. in Java):

RowFilter(=,'regexstring:^\\d{4}$')

regarding 1. :

I only would image that the query optimization without ending $ lets HBase return you an range (which could be fast to find via the hashing) but if you require the exact length HBase has to check again all entries in the relevant range (with all resources reserved and added to fulfil the task).

InLaw
  • 2,537
  • 2
  • 21
  • 33