13

Could someone explain such a big performance difference between these SQLs ?

SELECT count(*) as cnt FROM table WHERE name ~ '\*{3}'; -- Total runtime 12.000 - 18.000 ms
SELECT count(*) as cnt FROM table WHERE name ~ '\*\*\*'; -- Total runtime 12.000 - 18.000 ms
SELECT count(*) as cnt FROM table WHERE name LIKE '%***%'; -- Total runtime 5.000 - 7.000 ms

As you can see, the difference is more than double between LIKE operator and simple regular expression (I thought LIKE operator internally would be converted into the regular expression and there shouldn't be any difference)

There are almost 13000 rows there and the column "name" is of "text" type. There are no indexes related to the "name" column defined in the table.

EDIT:

EXPLAIN ANALYZE OF EACH OF THEM:

EXPLAIN ANALYZE SELECT count(*) as cnt FROM datos WHERE nombre ~ '\*{3}';

Aggregate  (cost=894.32..894.33 rows=1 width=0) (actual time=18.279..18.280 rows=1 loops=1)
  ->  Seq Scan on datos (cost=0.00..894.31 rows=1 width=0) (actual time=0.620..18.266 rows=25 loops=1)
        Filter: (nombre ~ '\*{3}'::text)
Total runtime: 18.327 ms

EXPLAIN ANALYZE SELECT count(*) as cnt FROM datos WHERE nombre ~ '\*\*\*';
Aggregate  (cost=894.32..894.33 rows=1 width=0) (actual time=17.404..17.405 rows=1 loops=1)
  ->  Seq Scan on datos  (cost=0.00..894.31 rows=1 width=0) (actual time=0.608..17.396 rows=25 loops=1)
        Filter: (nombre ~ '\*\*\*'::text)
Total runtime: 17.451 ms

EXPLAIN ANALYZE SELECT count(*) as cnt  FROM datos WHERE nombre LIKE '%***%';
Aggregate  (cost=894.32..894.33 rows=1 width=0) (actual time=4.258..4.258 rows=1 loops=1)
  ->  Seq Scan on datos  (cost=0.00..894.31 rows=1 width=0) (actual time=0.138..4.249 rows=25 loops=1)
        Filter: (nombre ~~ '%***%'::text)
Total runtime: 4.295 ms
Aldwoni
  • 1,168
  • 10
  • 24
dmikam
  • 992
  • 1
  • 18
  • 27
  • 2
    Show `explain analyze` for reach please. – Craig Ringer Apr 06 '15 at 08:21
  • @CraigRinger I added explain analyze of each query into the question's text – dmikam Apr 06 '15 at 08:55
  • 4
    To run a regex comparison is more expensive than to apply a dummy `LIKE` format. – zerkms Apr 06 '15 at 08:55
  • @zerkms the question is why? the answer "is more expensive than" is very generic and could be subjective. I would like to know the reason to use this knowledge in the future to optimize queries. – dmikam Apr 06 '15 at 08:58
  • @zerkms Ok, do you agree that this pattern `\*{3}\s*$` (matching only strings ending with ***) should be faster than `'%***%'` as it only needs to scan the end of the string? Well it still doubles the time of LIKE query. Even `%*%*%*%` LIKE pattern is faster than each of those regex queries (on my opinion it is much more complex to analyse than 'line ending' regex pattern) – dmikam Apr 06 '15 at 09:18
  • 1
    @dmikam I do not agree - regex is harder to parse and harder to apply. If you don't agree with me - try to implement both `LIKE` and PCRE-compatible engines. Then compare which one took more effort and works slower. "matching only strings ending with ***" --- nope, there are also whitespaces there. – zerkms Apr 06 '15 at 09:20
  • PostgreSQL version please - `SELECT version();` – Craig Ringer Apr 06 '15 at 11:49
  • `SELECT version();` PostgreSQL 9.1.15 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit – dmikam Apr 06 '15 at 11:50
  • 1
    The three queries have a small footprint, and are completely served from memory/buffers. That's why the CPU cost dominates the total cost. Once the data has to be pulled from disk, the cost will be dominated by seektime and I/O, and the queries will roughly perform the same. (at least: that is what I expect) – wildplasser Apr 06 '15 at 13:33
  • @wildplasser Made another test with a table of 400.000 rows. Results: ~ '\*{3}' -- 250.000ms – dmikam Apr 06 '15 at 13:56
  • That was the first result? The other two tests still running? BTW: Are you aware of *cache-warming* effects? – wildplasser Apr 06 '15 at 14:08
  • Well, I'll publish my small investigation as an answer... but if someone has better explanation - you are welcome. – dmikam Apr 06 '15 at 14:41

1 Answers1

18

The text LIKE text operator (~~) is implemented by specific C code in like_match.c. It's ad-hoc code that is completely independent from regular expressions. Looking at the comments, it's obviously specially optimized to implement only % and _ as wildcards, and short-circuiting to an exit whenever possible, whereas a regular expression engine is more complex by several orders of magnitude.

Note that in your test case , just like the regexp is suboptimal compared to LIKE, LIKE is probably suboptimal compared to strpos(name, '***') > 0

strpos is implemented with the Boyer–Moore–Horspool algorithm which is optimized for large substrings with few partial matches in the searched text.

Internally these functions are reasonably optimized but when there are several methods to the same goal, choosing the likely best is still the job of the caller. PostgreSQL will not analyze for us the pattern to match and switch a regexp into a LIKE or a LIKE into a strpos based on that analysis.

Juergen
  • 12,378
  • 7
  • 39
  • 55
Daniel Vérité
  • 58,074
  • 15
  • 129
  • 156
  • Note that several regex engines test the string before starting the engine walk with the Boyer-Moore algorithm too when the pattern contains or starts with a literal string (as a pre-optimization to fail faster). I don't know if it is the case with Postgres. – Casimir et Hippolyte Apr 06 '15 at 18:47
  • I looked through like_match.c and it is definitely much more simple then regex(even if it is still recursive). So my initial mistake was thinking that LIKE internally implemented using regex or something very similar. Naturally strpos algorithm is even "cheaper" but in my rough comparison it confirmed that the CPU cost of chosen string analysis algorithm could potentially create that difference in execution time I had. – dmikam Apr 07 '15 at 08:03