POSTGRES 9.4 has been generating a pretty poor query plan for a full text query with LIMIT 10 at the end:
SELECT * FROM Tbl
WHERE to_tsvector('english'::regconfig, ginIndexedColumn) @@ to_tsquery('rareword')
LIMIT 10
this generates:
"Limit (cost=0.00..306.48 rows=10 width=702) (actual time=5470.323..7215.486 rows=3 loops=1)"
" -> Seq Scan on tbl (cost=0.00..24610.69 rows=803 width=702) (actual time=5470.321..7215.483 rows=3 loops=1)"
" Filter: (to_tsvector('english'::regconfig, ginIndexedColumn) @@ to_tsquery('rareword'::text))"
" Rows Removed by Filter: 609661"
"Planning time: 0.436 ms"
"Execution time: 7215.573 ms"
using an index defined by:
CREATE INDEX fulltext_idx
ON Tbl
USING gin
(to_tsvector('english'::regconfig, ginIndexedColumn));
and it takes 5 or 6 seconds to execute. Even LIMIT 12 is slow.
However, the same query with LIMIT 13 (the lowest limit that hits the index)
SELECT * FROM Tbl
WHERE to_tsvector('english'::regconfig, ginIndexedColumn) @@ to_tsquery('rareword')
LIMIT 13
hits the index just fine and takes a few thousandths of a second. See output below:
"Limit (cost=350.23..392.05 rows=13 width=702) (actual time=2.058..2.062 rows=3 loops=1)"
" -> Bitmap Heap Scan on tbl (cost=350.23..2933.68 rows=803 width=702) (actual time=2.057..2.060 rows=3 loops=1)"
" Recheck Cond: (to_tsvector('english'::regconfig, ginIndexedColumn) @@ to_tsquery('rareword'::text))"
" Heap Blocks: exact=2"
" -> Bitmap Index Scan on fulltext_idx (cost=0.00..350.03 rows=803 width=0) (actual time=2.047..2.047 rows=3 loops=1)"
" Index Cond: (to_tsvector('english'::regconfig, ginIndexedColumn) @@ to_tsquery('rareword'::text))"
"Planning time: 0.324 ms"
"Execution time: 2.145 ms"
The reason why the query plan is poor is that the word is rare and there's only 2 or 3 records in the whole 610K record table that satisfy the query, meaning the sequential scan the query optimizer picks will have to scan the whole table before the limit is ever filled. The sequential scan would obviously be quite fast if the word is common because the limit would be filled in no time.
Obviously, this little bug is no big deal. I'll simply use Limit 13 instead of 10. What's three more items. But it took me so long to realize the limit clause might affect whether it hits the index. I'm worried that there might be other little surprises in store with other SQL functions that prevent the index from being hit. What I'm looking for is assistance in tweaking Postgres to hit the GIN index all the time instead of sometimes for this particular table.
I'm quite willing to forgo possibly cheaper queries if I could be satisfied that the index is always being hit. It's incredibly fast. I don't care to save any more microseconds.