How do I fetch similar posts from database where text length can be more than 2000 characters

Question

As far as I'm aware, there is no simple, quick solution. I am trying to do a full-text keyword or semantic search, which is a very advanced topic. There are dedicated search servers created specifically for that reason, but still is there a way that I can implement for a query execution time for less than a second?

Here's what I have tried so far:

begin;

SET pg_trgm.similarity_threshold = 0.3;

select
    id, <col_name>
    similarity(<column with gin index>,
    '<text to be searched>') as sml
from
    <table> p
where
    <clauses> and
 <indexed_col> % '<text to be searched>'    
 and indexed_col <-> '<text to be searched>' < 0.5
order by
indexed_col <-> '<text to be searched>'
limit 10;
 end;

Index created is as follows: CREATE INDEX trgm_idx ON posts USING gin (post_title_combined gin_trgm_ops);

The above query takes around 6-7 secs to execute and sometimes even 200 ms which is weird to me because it changes the query plan according to the input I pass in for similarity.

I tried ts_vector @@ ts_query, but they turn out to be too strict due to & operator.

EDIT: Here's the EXPLAIN ANALYZE of the above query

  ->  Sort  (cost=463.82..463.84 rows=5 width=321) (actual time=3778.726..3778.728 rows=0 loops=1)
        Sort Key: ((post_title_combined <-> 'Test text not to be disclosed'::text))
        Sort Method: quicksort  Memory: 25kB
        ->  Bitmap Heap Scan on posts p  (cost=404.11..463.77 rows=5 width=321) (actual time=3778.722..3778.723 rows=0 loops=1)
              Recheck Cond: (post_title_combined % 'Test text not to be disclosed'::text)
              Rows Removed by Index Recheck: 36258
              Filter: ((content IS NOT NULL) AND (is_crawlable IS TRUE) AND (score IS NOT NULL) AND (status = 1) AND ((post_title_combined <-> 'Test text not to be disclosed'::text) < '0.5'::double precision))
              Heap Blocks: exact=24043
              ->  Bitmap Index Scan on trgm_idx  (cost=0.00..404.11 rows=15 width=0) (actual time=187.394..187.394 rows=36916 loops=1)
                    Index Cond: (post_title_combined % 'Test text not to be disclosed'::text)
Planning Time: 8.782 ms
Execution Time: 3778.787 ms```

Could you share the results from EXPLAIN(ANALYZE, VERBOSE, BUFFERS) for the different queries? Without it, it's hard to help you — Frank Heikens, May 10 '22 at 19:17
@FrankHeikens I have added Explain Analyze in the Description. — Arnob, May 10 '22 at 19:25
Do you have one or more indexes on all or a combination of columns in the WHERE condition? There is a lot of filtering done after the trigram has been fetched. This filter is removing almost all results, the index did most of it's work for nothing. Could try to add these columns to the trigram index? Rows Removed by Index Recheck: 36258 — Frank Heikens, May 10 '22 at 20:09
Yes, I actually have 3 more where clauses that were necessary as can be seen in Filter: ((content IS NOT NULL) AND (is_crawlable IS TRUE) AND (score IS NOT NULL) AND (status = 1) AND ((post_title_combined <-> 'Test text not to be disclosed'::text) < '0.5'::double precision)). Though near operator was not a need since I am using % but even without it execution time is quite slow. But Ill try adding columns to trigram index, not sure how, currently I have gin indexing on just one column that is being used for similarity matching. — Arnob, May 11 '22 at 02:29

score 1 · Accepted Answer · answered May 11 '22 at 00:52

1

Your redundant/overlapping query conditions aren't helpful. Setting similarity_threshold=0.3 then doing

t % q and t <-> q < 0.5

just throws away index selectivity for no reason. Set similarity_threshold to as stringent of a value as you want to use, then get rid of the unnecessary <-> condition.

You could try the GiST version of trigram indexing. I can support the ORDER BY ... <-> ... LIMIT 10 operation directly from the index. I doubt it will be very effective with 2000 char strings, but it is worth a try.

answered May 11 '22 at 00:52

jjanes

37,812
5
27
34

Agree, I thought putting it will help me get results faster as doing t <-> q > 0.5 gives faster but less similar results, but even without near operator the result is quite slow. And I agree search on 2000+ characters makes things less effective. Setting threshold of 0.1 gives faster results and 0.8 too, everything in between takes around 3 to 6 seconds. – Arnob May 11 '22 at 02:47
I was just wondering how StackOverflow or Quora does this. – Arnob May 11 '22 at 02:48
Tried Gist SQL but getting, Error [54000]: ERROR: index row requires 14800 bytes, maximum size is 8191, probably due to large size. – Arnob May 11 '22 at 03:55
I'm not sure why postgres is adding all columns size for indexing purpose – Arnob May 11 '22 at 04:25
Not sure what you mean about adding all columns. Your example only has two columns, and one is presumably an int. You could ask a new question filling in the missing details. – jjanes May 12 '22 at 15:11

How do I fetch similar posts from database where text length can be more than 2000 characters

1 Answers1