4

I am experimenting with a full-text search system over my PostgreSQL database, where I am using tsvectors with ts_rank() to pull out relevant items to a user search query. In general this works really fantastic as a simple solution (i.e. no major overhead infrastructure). However, I am finding that the ts_headline() component (which gives users context for the relevance of the search results) is slowing down my queries significantly, by an order of about 10x. I wanted to inquire what is the best way to use ts_headline() without incurring computational expense.

To give an example, here is a very fast tsvector search that does not use ts_headline(). For context, my table has two relevant fields, search_text which has the natural-language text which is being searched against, and search_text_tsv which is a tsvector that is directly queried against (and also used to rank the item). When I use ts_headline(), it references the main search_text field in order to produce a user-readable headline. Further, the column search_text_tsv is indexed using GIN, which provides very fast lookups for @@ websearch_to_tsquery('my query here').

Again, here is query #1:

SELECT
    item_id,
    title,
    author,
    search_text,
    ts_rank(search_text_tsv, websearch_to_tsquery(unaccent('my query text here')), 1) as rank
FROM search_index
WHERE search_text_tsv @@ websearch_to_tsquery(unaccent('my query text here'))
ORDER BY rank DESC
LIMIT 20 OFFSET 20

This gives me 20 top results very fast, running on my laptop about 50ms.

Now, query #2 uses ts_headline() to produce a user-readable headline. I found that this was very slow when it ran against all possible search results, so I used a sub-query to produce the top 20 results and then calculated the ts_headline() only for those top results (as opposed to, say, 1000 possible results).

SELECT *,
    ts_headline(search_text,websearch_to_tsquery(unaccent('my query text here')),'StartSel=<b>,StopSel=</b>,MaxFragments=2,' || 'FragmentDelimiter=...,MaxWords=10,MinWords=1') AS headline
FROM (
SELECT
    item_id,
    title,
    author,
    search_text,
    ts_rank(search_text_tsv, websearch_to_tsquery(unaccent('my query text here')), 1) as rank
FROM search_index
WHERE search_text_tsv @@ websearch_to_tsquery(unaccent('my query text here'))
ORDER BY rank DESC
LIMIT 20 OFFSET 20) as foo

Basically, what this does is limits the # of results (as in the first query), and then uses that as a sub-query, returning all of the columns in the subquery (i.e. *) and also the ts_headline() calculation. However, this is very slow, by an order of magnitude of about 10, coming in at around 800ms on my laptop.

Is there anything I can do to speed up ts_headline()? It seems pretty clear that this is what is slowing down the second query.

For reference, here are the query plans being produced by Postgresql (from EXPLAIN ANALYZE):

Query plan 1: (straight full-text search)

Limit  (cost=56.79..56.79 rows=1 width=270) (actual time=66.118..66.125 rows=20 loops=1)
   ->  Sort  (cost=56.78..56.79 rows=1 width=270) (actual time=66.113..66.120 rows=40 loops=1)
         Sort Key: (ts_rank(search_text_tsv, websearch_to_tsquery(unaccent('my search query here'::text)), 1)) DESC
         Sort Method: top-N heapsort  Memory: 34kB
         ->  Bitmap Heap Scan on search_index  (cost=52.25..56.77 rows=1 width=270) (actual time=1.070..65.641 rows=462 loops=1)
               Recheck Cond: (search_text_tsv @@ websearch_to_tsquery(unaccent('my search query here'::text)))
               Heap Blocks: exact=424
               ->  Bitmap Index Scan on idx_fts_search  (cost=0.00..52.25 rows=1 width=0) (actual time=0.966..0.966 rows=462 loops=1)
                     Index Cond: (search_text_tsv @@ websearch_to_tsquery(unaccent('my search query here'::text)))
 Planning Time: 0.182 ms
 Execution Time: 66.154 ms

Query plan 2: (full text search w/ subquery & ts_headline())

 Subquery Scan on foo  (cost=56.79..57.31 rows=1 width=302) (actual time=116.424..881.617 rows=20 loops=1)
   ->  Limit  (cost=56.79..56.79 rows=1 width=270) (actual time=62.470..62.497 rows=20 loops=1)
         ->  Sort  (cost=56.78..56.79 rows=1 width=270) (actual time=62.466..62.484 rows=40 loops=1)
               Sort Key: (ts_rank(search_index.search_text_tsv, websearch_to_tsquery(unaccent('my search query here'::text)), 1)) DESC
               Sort Method: top-N heapsort  Memory: 34kB
               ->  Bitmap Heap Scan on search_index  (cost=52.25..56.77 rows=1 width=270) (actual time=2.378..62.151 rows=462 loops=1)
                     Recheck Cond: (search_text_tsv @@ websearch_to_tsquery(unaccent('my search query here'::text)))
                     Heap Blocks: exact=424
                     ->  Bitmap Index Scan on idx_fts_search  (cost=0.00..52.25 rows=1 width=0) (actual time=2.154..2.154 rows=462 loops=1)
                           Index Cond: (search_text_tsv @@ websearch_to_tsquery(unaccent('my search query here'::text)))
 Planning Time: 0.350 ms
 Execution Time: 881.702 ms

Jason
  • 14,517
  • 25
  • 92
  • 153
  • Please show `EXPLAIN (ANALYZE, BUFFERS)`, and if possible turn on track_io_timing first. – jjanes Mar 04 '22 at 23:37
  • How big is a typical search_text? – jjanes Mar 04 '22 at 23:38
  • @jjanes `search_text` can be quite large depending on the text. Sometimes 1000s of words long. I wonder if it would be best if I skipped the sub query and just return the whole search text and let my Python code parse it to display some context based on the search query. It would speed up the DB query and give me more control over how expensive the headline creation would be n – Jason Mar 05 '22 at 12:49
  • 1
    Can you play with LIMIT and OFFSET to see if all 20 are quite slow, or if one of them is just pathologically slow? If the latter, could you share search_text and query text for that one? – jjanes Mar 05 '22 at 23:43

2 Answers2

0

Just encountered exactly the same issue. When collecting a list of search results (20-30 documents) and also getting their ts_headline highlight in the same query the execution time was x10 at least.

To be fair, Postgres documentation is warning about that [1]:

ts_headline uses the original document, not a tsvector summary, so it can be slow and should be used with care.

I ended up getting the list of documents first and then loading the highlights with ts_headline asynchronously one-by-one. Still slow single queries (>150ms) but better user experience then waiting multiple seconds for an initial load.

[1] https://www.postgresql.org/docs/15/textsearch-controls.html#TEXTSEARCH-HEADLINE

perelin
  • 1,367
  • 1
  • 17
  • 32
0

I think I can buy you a few more milliseconds. In your query, you're returning "SELECT *, ts_headline" which includes the full original document search_text in the return. When I limited my SELECT to everything but the "search_text" from the subquery (+ ts_headline as headline), my queries dropped from 500-800ms to 100-400ms. I'm also using AWS RDS so that might play a role on my end.

Yunnosch
  • 26,130
  • 9
  • 42
  • 54
Dan K.
  • 1