Why Spanner performs full table scan using a underscore in a LIKE, while using % leverages the index?

Question

In a query, if I use LIKE '<value>%' on the primary key it performs well, using the index:

Operator | Rows returned | Executions | Latency
-- | -- | -- | --
 Serialize Result   32  1   1.80 ms
 Sort   32  1   1.78 ms
 Hash Aggregate 32  1   1.73 ms
 Distributed union  32  1   1.61 ms
 Hash Aggregate 32  1   1.56 ms
 Distributed union  128 1   1.34 ms
 Compute    -   -   -
 FilterScan 128 1   1.33 ms
 Table Scan: <tablename>    128 1   1.30 ms

Nevertheless, using LIKE '<value>_' performs a full table scan:

Operator | Rows returned | Executions | Latency
-- | -- | -- | --
Serialize Result | 32 | 1 | 76.27 s
Sort | 32 | 1 | 76.27 s
Hash Aggregate | 32 | 1 | 76.27 s
Distributed union | 32 | 1 | 76.27 s
Hash Aggregate | 32 | 2 | ~72.18 s
Distributed union | 128 | 2 | ~72.18 s
Compute | - | - | -
FilterScan | 128 | 2 | ~72.18 s
Table Scan: <tablename> (full scan: true) | 13802624 | 2 | ~69.97 s

The query looks like this:

SELECT
    'aggregated-quadkey AS quadkey' AS quadkey, day,
    SUM(a_value_1), SUM(a_value_2), AVG(a_value_3), SUM(a_value_4), SUM(a_value_5), AVG(a_value_6), AVG(a_value_6), AVG(a_value_7), SUM(a_value_8), SUM(a_value_9), AVG(a_value_10), SUM(a_value_11), SUM(a_value_12), AVG(a_value_13), AVG(a_value_14), AVG(a_value_15), SUM(a_value_16), SUM(a_value_17), AVG(a_value_18), SUM(a_value_19), SUM(a_value_20), AVG(a_value_21), AVG(a_value_22), AVG(a_value_23)
FROM <tablename>
WHERE quadkey LIKE '03201012212212322_'
GROUP BY quadkey, day ORDER BY day

score 5 · Answer 1 · answered Jul 17 '19 at 07:34

For a prefix matching LIKE pattern (column LIKE 'xxx%'), the query optimiser internally converts the condition into STARTS_WITH(column, 'xxx'), which then uses the index.

So the reason is probably because the query optimizer is not smart enough to convert an exact length prefix matching LIKE pattern

column LIKE 'xxx_'

into a combined condition:

(STARTS_WITH(column, 'xxx') AND CHAR_LENGTH(column)=4)

Similarly, a pattern such as

`column LIKE 'abc%def'`

is not optimised into the combined condition:

`(STARTS_WITH(column,'abc') AND ENDS_WITH(column,'def'))`.

You can always work around this by optimising the query in your SQL generation by using the above condition.

(This is assuming that the LIKE pattern is a string value in the query, not a parameter - LIKE using a parameter cannot be optimised because the pattern is not known at query compile time.)

Yep, that's my assumption. I was just wondering if it's just an optimizer limitation or is there any intrinsic issue that I can't see (`_` is more restrictive than `%`, so it shouldn't perform worse at all). PD: yes, the pattern is a value, not a parameter. — juanignaciosl, Jul 17 '19 at 08:12

score 4 · Accepted Answer · answered Jul 17 '19 at 07:55

4

Thank you for reporting this! I have added this rewrite in the backlog. In the meantime, you can use STARTS_WITH and CHAR_LENGTH to work around the issue as RedPandaCurios suggested.

answered Jul 17 '19 at 07:55

yongchul

371
1
5

You can find some info on [answer] to upgrade your answer. It could be interesting to know how you came up with your answer. – Stef Geysels Jul 17 '19 at 08:10
yongchul works on Google Cloud Spanner (from his profile) – RedPandaCurios Jul 18 '19 at 09:05

Why Spanner performs full table scan using a underscore in a LIKE, while using % leverages the index?

2 Answers2