Why up to index_granularity * 2 extra rows will be read for ClickHouse primary key?

Question

According to ClickHouse documentation, when reading a single range of the primary key, up to index_granularity * 2 extra rows in each data block can be read.

Why index_granularity * 2 extra rows? I can't figure this out.

Take the (CounterID, Date) primary key as an example. In this case, the sorting and index can be illustrated as follows:
  Whole data:     [---------------------------------------------]
  CounterID:      [aaaaaaaaaaaaaaaaaabbbbcdeeeeeeeeeeeeefgggggggghhhhhhhhhiiiiiiiiikllllllll]
  Date:           [1111111222222233331233211111222222333211111112122222223111112223311122333]
  Marks:           |      |      |      |      |      |      |      |      |      |      |
                  a,1    a,2    a,3    b,3    e,2    e,3    g,1    h,2    i,1    i,3    l,3
  Marks numbers:   0      1      2      3      4      5      6      7      8      9      10
If the data query specifies:

CounterID in ('a', 'h'), the server reads the data in the ranges of marks [0, 3) and [6, 8).

CounterID IN ('a', 'h') AND Date = 3, the server reads the data in the ranges of marks [1, 3) and [7, 8).

Date = 3, the server reads the data in the range of marks [1, 10].

The examples above show that it is always more effective to use an index than a full scan.

A sparse index allows extra data to be read. When reading a single range of the primary key, up to index_granularity * 2 extra rows in each data block can be read.

Primary Keys and Indexes in Queries

vladimir · Answer 1 · 2021-01-05T08:06:12.387

Let's consider the edge case on this example:

  CounterID:      [aaaaaaaaaaaaaaaaaaaaabcdeeeeeeeeeeeeefggggg]
  Date:           [1111111222222233331233211111222222333211111]
  Marks:           |      |      |      |      |      |      |
                  a,1    a,2    a,3    b,3    e,2    e,3    g,1
  Marks numbers:   0      1      2      3      4      5      6

Query with condition CounterID = 'b' will read data in the range [2, 4) where just one row is target but (index_granularity * 2) are extra ones.

In this example index_granularity (it is the number of data rows between the marks of an index) is 6.

Why up to index_granularity * 2 extra rows will be read for ClickHouse primary key?

1 Answers1