11

I know that fuzzy row filter takes two parameters first being row key and second being fuzzy logic. What i understood from the corresponding java class FuzzyRowFilter is, the filter evaluates the current row and try to compute the next higher row key that will match the fuzzy logic and it jumps the non matching keys.

I am unable to understand following things

How scan jumps certain row keys? Does it use Get to get and compare the current row key. How scan get to know where the next matching row key exists? without doing a full scan(if it jumps)

Igor Katkov
  • 6,290
  • 1
  • 16
  • 17

2 Answers2

14

You understood everything correctly.

For those who came here from web-search here are two links that explains how row skipping can be leveraged in general and how it's done in FuzzyRowFilter in particular

  1. HBase FuzzyRowFilter: Alternative to Secondary Indexes
  2. Filters in HBase (or intra row scanning part II)

If a filter knows it's at the last key and needs a skip:

  1. Filter returns SEEK_NEXT_USING_HINT
  2. Region Server calls getNextCellHint which returns a suggested Cell
  3. Region Server performs exactly same routine of finding a key as it did for the first key in scan - it examines available HFiles checking if the key in question is there
    1. Region Server reads the "trailer" section of each file to get offsets of metadatablocks
    2. Region Server reads Meta and FileInfo metadata block types to avoid reading the binary data from the hfile if there’s no chance that the key is present (Bloom Filter), if the file is too old (Max SequenceId) or if the file is too new (Timerange) to contain what we’re looking for. See more about HFile format here
    3. Should the key be inside the HFile, Region Server uses DataBlock index segments to compute offset of to the location of the datablock with has the key in question
    4. if the datablock with the key happens already be in the Region Server block cache, next step is skipped
    5. Datablock is read from HFile
    6. Region Server finally scans keys, one-by-one until it hits the target one
  4. The found key, and potentially whole row (depending on the filter), is passed to the filter code
  5. Whole cycle repeats
Igor Katkov
  • 6,290
  • 1
  • 16
  • 17
0

The first thing to know about hbase keys is that is kept in a lexicographically sorted order, this data is stored by the hbase master in the meta file. So when fuzzy row filter is applied it can directly skip all the values that do not match the row key.

Now all it has to do is select the row keys and then scan through the uncertain parts of the key.

eg. if your row key range is 123456689 - 123456889 then your fuzzy row filter will be 123456??? - What happens here is that the fuzzy row filter skips to the row which starts with 123456, the range of the fuzzy row filter will be as follows 123456000 - 123456999

Jijo
  • 611
  • 5
  • 18