I am experimenting with a full-text search system over my PostgreSQL database, where I am using tsvectors
with ts_rank()
to pull out relevant items to a user search query. In general this works really fantastic as a simple solution (i.e. no major overhead infrastructure). However, I am finding that the ts_headline()
component (which gives users context for the relevance of the search results) is slowing down my queries significantly, by an order of about 10x. I wanted to inquire what is the best way to use ts_headline()
without incurring computational expense.
To give an example, here is a very fast tsvector search that does not use ts_headline()
. For context, my table has two relevant fields, search_text
which has the natural-language text which is being searched against, and search_text_tsv
which is a tsvector
that is directly queried against (and also used to rank the item). When I use ts_headline()
, it references the main search_text
field in order to produce a user-readable headline. Further, the column search_text_tsv
is indexed using GIN, which provides very fast lookups for @@ websearch_to_tsquery('my query here')
.
Again, here is query #1:
SELECT
item_id,
title,
author,
search_text,
ts_rank(search_text_tsv, websearch_to_tsquery(unaccent('my query text here')), 1) as rank
FROM search_index
WHERE search_text_tsv @@ websearch_to_tsquery(unaccent('my query text here'))
ORDER BY rank DESC
LIMIT 20 OFFSET 20
This gives me 20 top results very fast, running on my laptop about 50ms.
Now, query #2 uses ts_headline()
to produce a user-readable headline. I found that this was very slow when it ran against all possible search results, so I used a sub-query to produce the top 20 results and then calculated the ts_headline()
only for those top results (as opposed to, say, 1000 possible results).
SELECT *,
ts_headline(search_text,websearch_to_tsquery(unaccent('my query text here')),'StartSel=<b>,StopSel=</b>,MaxFragments=2,' || 'FragmentDelimiter=...,MaxWords=10,MinWords=1') AS headline
FROM (
SELECT
item_id,
title,
author,
search_text,
ts_rank(search_text_tsv, websearch_to_tsquery(unaccent('my query text here')), 1) as rank
FROM search_index
WHERE search_text_tsv @@ websearch_to_tsquery(unaccent('my query text here'))
ORDER BY rank DESC
LIMIT 20 OFFSET 20) as foo
Basically, what this does is limits the # of results (as in the first query), and then uses that as a sub-query, returning all of the columns in the subquery (i.e. *
) and also the ts_headline()
calculation. However, this is very slow, by an order of magnitude of about 10, coming in at around 800ms on my laptop.
Is there anything I can do to speed up ts_headline()
? It seems pretty clear that this is what is slowing down the second query.
For reference, here are the query plans being produced by Postgresql (from EXPLAIN ANALYZE
):
Query plan 1: (straight full-text search)
Limit (cost=56.79..56.79 rows=1 width=270) (actual time=66.118..66.125 rows=20 loops=1)
-> Sort (cost=56.78..56.79 rows=1 width=270) (actual time=66.113..66.120 rows=40 loops=1)
Sort Key: (ts_rank(search_text_tsv, websearch_to_tsquery(unaccent('my search query here'::text)), 1)) DESC
Sort Method: top-N heapsort Memory: 34kB
-> Bitmap Heap Scan on search_index (cost=52.25..56.77 rows=1 width=270) (actual time=1.070..65.641 rows=462 loops=1)
Recheck Cond: (search_text_tsv @@ websearch_to_tsquery(unaccent('my search query here'::text)))
Heap Blocks: exact=424
-> Bitmap Index Scan on idx_fts_search (cost=0.00..52.25 rows=1 width=0) (actual time=0.966..0.966 rows=462 loops=1)
Index Cond: (search_text_tsv @@ websearch_to_tsquery(unaccent('my search query here'::text)))
Planning Time: 0.182 ms
Execution Time: 66.154 ms
Query plan 2: (full text search w/ subquery & ts_headline()
)
Subquery Scan on foo (cost=56.79..57.31 rows=1 width=302) (actual time=116.424..881.617 rows=20 loops=1)
-> Limit (cost=56.79..56.79 rows=1 width=270) (actual time=62.470..62.497 rows=20 loops=1)
-> Sort (cost=56.78..56.79 rows=1 width=270) (actual time=62.466..62.484 rows=40 loops=1)
Sort Key: (ts_rank(search_index.search_text_tsv, websearch_to_tsquery(unaccent('my search query here'::text)), 1)) DESC
Sort Method: top-N heapsort Memory: 34kB
-> Bitmap Heap Scan on search_index (cost=52.25..56.77 rows=1 width=270) (actual time=2.378..62.151 rows=462 loops=1)
Recheck Cond: (search_text_tsv @@ websearch_to_tsquery(unaccent('my search query here'::text)))
Heap Blocks: exact=424
-> Bitmap Index Scan on idx_fts_search (cost=0.00..52.25 rows=1 width=0) (actual time=2.154..2.154 rows=462 loops=1)
Index Cond: (search_text_tsv @@ websearch_to_tsquery(unaccent('my search query here'::text)))
Planning Time: 0.350 ms
Execution Time: 881.702 ms