2

POSTGRES 9.4 has been generating a pretty poor query plan for a full text query with LIMIT 10 at the end:

SELECT * FROM Tbl 
WHERE to_tsvector('english'::regconfig, ginIndexedColumn) @@ to_tsquery('rareword')  
LIMIT 10

this generates:

"Limit  (cost=0.00..306.48 rows=10 width=702) (actual time=5470.323..7215.486 rows=3 loops=1)"
"  ->  Seq Scan on tbl  (cost=0.00..24610.69 rows=803 width=702) (actual time=5470.321..7215.483 rows=3 loops=1)"
"        Filter: (to_tsvector('english'::regconfig, ginIndexedColumn) @@ to_tsquery('rareword'::text))"
"        Rows Removed by Filter: 609661"
"Planning time: 0.436 ms"
"Execution time: 7215.573 ms"

using an index defined by:

CREATE INDEX fulltext_idx
ON Tbl
USING gin
(to_tsvector('english'::regconfig, ginIndexedColumn));

and it takes 5 or 6 seconds to execute. Even LIMIT 12 is slow.

However, the same query with LIMIT 13 (the lowest limit that hits the index)

SELECT * FROM Tbl 
WHERE to_tsvector('english'::regconfig, ginIndexedColumn) @@ to_tsquery('rareword')  
LIMIT 13

hits the index just fine and takes a few thousandths of a second. See output below:

"Limit  (cost=350.23..392.05 rows=13 width=702) (actual time=2.058..2.062 rows=3 loops=1)"
"  ->  Bitmap Heap Scan on tbl  (cost=350.23..2933.68 rows=803 width=702) (actual time=2.057..2.060 rows=3 loops=1)"
"        Recheck Cond: (to_tsvector('english'::regconfig, ginIndexedColumn) @@ to_tsquery('rareword'::text))"
"        Heap Blocks: exact=2"
"        ->  Bitmap Index Scan on fulltext_idx  (cost=0.00..350.03 rows=803 width=0) (actual time=2.047..2.047 rows=3 loops=1)"
"              Index Cond: (to_tsvector('english'::regconfig, ginIndexedColumn) @@ to_tsquery('rareword'::text))"
"Planning time: 0.324 ms"
"Execution time: 2.145 ms"

The reason why the query plan is poor is that the word is rare and there's only 2 or 3 records in the whole 610K record table that satisfy the query, meaning the sequential scan the query optimizer picks will have to scan the whole table before the limit is ever filled. The sequential scan would obviously be quite fast if the word is common because the limit would be filled in no time.

Obviously, this little bug is no big deal. I'll simply use Limit 13 instead of 10. What's three more items. But it took me so long to realize the limit clause might affect whether it hits the index. I'm worried that there might be other little surprises in store with other SQL functions that prevent the index from being hit. What I'm looking for is assistance in tweaking Postgres to hit the GIN index all the time instead of sometimes for this particular table.

I'm quite willing to forgo possibly cheaper queries if I could be satisfied that the index is always being hit. It's incredibly fast. I don't care to save any more microseconds.

Isaac Bolinger
  • 7,328
  • 11
  • 52
  • 90
  • Please add the output of `explain analyze`. For just 800 rows using an index seems hardly necessary –  Oct 16 '15 at 13:41
  • There should be 600k rows, I don't know where it gets 803 from – Isaac Bolinger Oct 16 '15 at 15:46
  • I added 'analyze' to the explain output for you. The 803 is what's left after the ~610k are filtered out. – Isaac Bolinger Oct 16 '15 at 15:54
  • Sounds as if your statistics aren't up-to-date. Does it change when you run `analyze Tbl`? –  Oct 16 '15 at 16:22
  • No, there's definitely ~610k rows, I just ran analyze tbl. I loaded them in there last night to help test my site. There's definitely a bad query plan being generated. It's definitely as slow as 6 seconds. This problem sort of eluded me until I accidentally queried for a rare record. Perhaps it eluded the Postgres guys as well for the same reason. It's just a minor bug that's all. – Isaac Bolinger Oct 16 '15 at 16:26

2 Answers2

1

Well, it's obviously an incorrect selectivity estimation. The planner thinks that to_tsvector('english'::regconfig, ginIndexedColumn) @@ to_tsquery('rareword') predicate will result in 803 rows, but actually there are only 3.

To tweak PostgreSQL to use the index you can:

  • Rewrite the query, for example using CTE, to postpone application of LIMIT:

    WITH t as ( SELECT * FROM Tbl WHERE to_tsvector('english'::regconfig, ginIndexedColumn) @@ to_tsquery('rareword') ) SELECT * FROM t LIMIT 10

    Of course, it makes LIMIT absolutely inefficient. (But in case of GIN index it's anyway not as efficient as it may be, because GIN cannot fetch results tuple-by-tuple. Instead it returns all the TIDs at once using bitmap. See also gin_fuzzy_search_limit.)

  • Set enable_seqscan=off or increase seq_page_cost to discourage the planner from using sequential scans (doc).

    It can however be undesirable if your query should use seqscans of other tables.

  • Use pg_hint_plan extension.

Egor Rogov
  • 5,278
  • 24
  • 38
1

Increasing the cost-estimate of the to_tsvector function as described here will probably solve the problem. This cost will automatically be increased in the next release (9.5) so adopting that change early should be considered a rather safe tweak to make.

Community
  • 1
  • 1
jjanes
  • 37,812
  • 5
  • 27
  • 34