0

I have a Postgresql table with something like 200k tuples, so not that much. What I try to do is filter out some rows and then order them using full-text matching:

SELECT * FROM descriptions as d 
WHERE d.category_id = ? 
AND d.description != '' 
AND regexp_replace(d.description, '(...)', '') !~* '...' 
AND regexp_replace(d.description, '...', '') !~* '...' 
AND d.id != ? 
ORDER BY ts_rank_cd(to_tsvector('english', name), plainto_tsquery('english', 'my search words')) DESC LIMIT 5 OFFSET 0';

There is a GIN index on description field.

Now this query works well only when there is less then 4000 or so records in the category. When its more like 5k or 6k then the query gets extremely slow.

I was trying different variations of this query. What I noticed is when I remove either WHERE clause or ORDER BY clause then I get big speed up. (Of course then I get irrelevant results)

What can I do to speedup this combination? Any way of optimizing or should I look for a solution outside Postgresql?

Additional question:

I'm experimenting further and for example this is the simplest query that I think runs too slow. Can I tell from explain analyze when it uses gist index and when doesn't?

SELECT d.*, d.description <-> 'banana' as dist FROM descriptions as d ORDER BY dist DESC LIMIT 5

"Limit  (cost=16046.88..16046.89 rows=5 width=2425) (actual time=998.811..998.813 rows=5 loops=1)"
"  ->  Sort  (cost=16046.88..16561.90 rows=206010 width=2425) (actual time=998.810..998.810 rows=5 loops=1)"
"        Sort Key: (((description)::text <-> 'banana'::text))"
"        Sort Method: top-N heapsort  Memory: 27kB"
"        ->  Seq Scan on products d  (cost=0.00..12625.12 rows=206010 width=2425) (actual time=0.033..901.260 rows=206010 loops=1)"
"Total runtime: 998.866 ms"`

Answered (kgrittn): DESC keyword is not correct for KNN-GiST and it's actually not wanted here. Removing it fixes the problem and gives right results.

Lubiluk
  • 704
  • 5
  • 12
  • Do you have any indices? What does the query planner tell you? –  May 07 '12 at 16:16
  • I do experiment with gin and gist indices but it looks like they are not used or something. How to see what planner tells me? Explain you mean? – Lubiluk May 07 '12 at 19:55

2 Answers2

0

For this type of application, we have been moving from the tsearch feature to the trigram feature; when you want to pick a small number of best matches, it is much faster. People here often prefer the semantics of the trigram similarity matching over the text-search ranking, anyway.

http://www.postgresql.org/docs/current/interactive/pgtrgm.html

"Borrowing" the later query from the edited question, formatting it, and including the index creation statement, to make the answer self-contained without a raft of comments:

CREATE INDEX descriptions_description_trgm
  ON descriptions
  USING gist (description gist_trgm_ops);

SELECT d.*, d.description <-> 'banana' as dist
  FROM descriptions as d
  ORDER BY dist LIMIT 5;

This should return rows from the GiST index in "distance" sequence until it hits the LIMIT.

kgrittn
  • 18,113
  • 3
  • 39
  • 47
  • I've updated the answer based on this stream of comments and am deleting my prior comments, since they are now incorporated in the answer. (@Lubiluk, you might want to do so, too.) – kgrittn May 07 '12 at 23:14
0

An output of explain analyze of your query would be helpful. But I guess that this regexp_replace lines are your problem. Postgres planner just cannot know how many rows will match this two lines, so it is guessing and planning a query based on this flawed quess.

I'd recommend to create a function like this:

create function good_description(text) returns boolean as $$
  select
   regexp_replace($1, '(...)', '') !~* '...' 
   and
   regexp_replace($1, '...', '') !~* '...'
$$ language sql immutable strict;

And creating a partial index on expression using this function:

create index descriptions_good_description_idx
  on good_description(description)
  where description != '';

And then querying in a way that allows Postgres to use this index:

SELECT * FROM descriptions as d 
WHERE d.category_id = ? 
AND d.description != '' 
AND good_description(d.description)
AND d.id != ? 
ORDER BY ts_rank_cd(
  to_tsvector('english', name),
  plainto_tsquery('english', 'my search words')
) DESC
LIMIT 5 OFFSET 0;
Tometzky
  • 22,573
  • 5
  • 59
  • 73
  • Looks like you're right here. Actually EXPLAIN ANALYZE gives me different outputs depending on the number of descriptions in a category. I think it's impossible to create an index for my regexps because the expressions are parametrized and different versions are used so I think I just need to create another table solely for keeping cached results of my regexp filters and update them each time I change filter rules. – Lubiluk May 07 '12 at 22:58
  • The problem with regular expression filters in where clauses is that it is completely impossible to predict even roughly how many will match before executing. If you have limited number of these regexps you can for example add additional boolean columns to this table that would say if regexp will match or not. These columns could be easily maintained using triggers. And they should not need indexes - column statistics maintained automatically by database should allow planner to estimate number of matching rows correctly if you filter by their values. This should be even faster than an index. – Tometzky May 08 '12 at 10:09