7

I have a table with about 100 million rows and a text field that I'd like to search over. I've come up with two methods for doing this and I'd like to know the performance implications of each method.

Method 1: This is the method recommended by every blog post I've seen online (e.g. 1 and 2.). The idea is to augment the table with a ts_vector column and index the new column.
A simple example is:

CREATE TABLE articles (
    id_articles BIGSERIAL PRIMARY KEY,
    text TEXT,
    text_tsv TSVECTOR
);
CREATE INDEX articles_index ON articles USING gin(text_tsv);

and then a trigger is used to ensure that the text and text_tsv columns remain up-to-date.
This seems wasteful to me, however, as now the TSVECTOR information must be stored in both the table and the index, and the database is made much more complicated. So I've come up with a second method.

Method 2: My idea is to eliminate the extra column and change the index to include the to_tsvector function directly, like so:

CREATE TABLE articles (
    id_articles BIGSERIAL PRIMARY KEY,
    text TEXT
);
CREATE INDEX articles_index ON articles USING gin(to_tsvector(text));

Question: Are there any downsides to using method 2 over method 1?

For my particular database, I've used the second method and I appear to get reasonable speedup for simple queries of a single word (search takes ~1 second). But when I have complex queries with several & and | operators in the to_tsquery function (and only ~10 matching results in the table), the search takes forever to run (many hours). If I switch to method 1, am I likely to see much faster query times for some reason?

If the slow performance of my queries is not due to my choice of method 2, is there anything else I might be able to do to speed up complex queries built with to_tsquery?

I'm using postgresql 10.10.

Mike Izbicki
  • 6,286
  • 1
  • 23
  • 53
  • Check the explain output, maybe include that in the question if suitable. Indices can be less effective when used in the context of multiple predicates. As per the documentation: [With the extra column approach] searches will be faster, since it will not be necessary to redo the to_tsvector calls to verify index matches – w08r Jan 15 '20 at 06:32
  • 1
    Your method is just as good, or even better, since it avoids redundancy. Your slow query probably cannot use the index. Use `EXPLAIN (ANALYZE, BUFFERS)` to debug this. – Laurenz Albe Jan 15 '20 at 06:48
  • What if I add the index on the text column instead of text_tsv in the method 1? Will that work or we need the tsvector data type only for the GIN indexing? – Santanu Jul 14 '22 at 13:16

1 Answers1

8

The downside of not storing the tsvector is that it will have to be recompute the tsvector from the raw text in order to "recheck" that the row meets the query. This can be very slow.

Rechecks are necessary if the size of the bitmap of candidate matches overflows work_mem. For some operators rechecks are always required, such as the phrase match operators <->, <2>, etc.

jjanes
  • 37,812
  • 5
  • 27
  • 34