0

I'm developing a multitenant web application, and I want to add full text search, so that people will be able to:

1) search only the site they are currently visiting (but not all sites), and
2) search only a section of that site (e.g. restrict search to a blog or a forum on the site), and
3) search a single forum thread only.

I wonder what indexes should I add?

Please assume the database is huge (so that e.g. index-scanning-by-site-ID and then filtering-by-full-text-search is too slow).

I can think of three approaches:

  1. Create three indexes. 1) One that indexes everything on a per site basis. And 2) one that indexes everything on a per-site plus site-section basis. And 3) one that indexes everything on a per-site and page-id basis.

  2. Create one single index, and insert into [the text to index] magic words like: "site_<site-id>" and "section_<section-id>" and "page_<page-id>", and then when I search for section XX in site YYY I could prefix the search query like so: "site_XX AND section_YYY AND ...".

  3. Dynamically add database indexes when a new site or site section is created:

     create index dw1_posts__search_site_YYY
       on dw1_posts using gin(to_tsvector('english', approved_text))
       where site_id = 'YYY';
    

Does any of these three approaches above make sense? Are there better alternatives?


(Details: However, perhaps approach 1 is impossible? Attempting to index-a-column and also index-for-full-text-searching at the same time, results in syntax errors:

> create index dw1_posts__search_site
    on dw1_posts (site_id)
    using gin(to_tsvector('english', approved_text));
ERROR:  syntax error at or near "using"
LINE 1: ...dex dw1_posts__search_site on dw1_posts(site_id) using gin(...
                                                             ^

> create index dw1_posts__search_site
    on dw1_posts
    using gin(to_tsvector('english', approved_text))
    (site_id);
ERROR:  syntax error at or near "("
LINE 1: ... using gin(to_tsvector('english', approved_text)) (site_id);

(If approach 1 was possible, then I could do queries like:

select ... from ... where site_id = ... and <full-text-search-column> @@ <query>;

and have PostgreSQL first check site_id and then the full-text-search column, using one single index.) )
/ End details.)


Update, one week later: I'm using ElasticSearch instead. I got the impression that no scalable solution exists, for faceted search, with relational databases / PostgreSQL. And integrating with ElasticSearch seems to be roughly as simple as implementing and testing and tweaking the approaches suggested here. (For example, PostgreSQL's stemmer/whatever-it's-called might split "section_NNN" into two words: "section" and "NNN" and thus index words that doesn't exist on the page! Tricky to fix such small annoying issues.)

KajMagnus
  • 11,308
  • 15
  • 79
  • 127

2 Answers2

1

The normal approach would be to create:

  • one full text index:

    CREATE INDEX idx1 ON dw1_posts USING gin(to_tsvector('english', approved_text));

  • a simple index on the site_id:

    CREATE INDEX idx2 on dw1_posts(page_id);

  • another simple index on the page_id:

    CREATE INDEX idx3 on dw1_posts(site_id);

Then it's the SQL planner's business to decide which ones to use if any, and in what order depending on the queries and the distribution of values in the columns. There is no point in trying to outsmart the planner before you've actually witnessed slow queries.

Daniel Vérité
  • 58,074
  • 15
  • 129
  • 156
  • Thanks, that's a good approach, as long as there aren't very many pages in the database. And (am I correct?) it breaks down when there are really many pages: Postgres would do either 1) full text searches and filter by page/site id, or 2) sequential index scans, and filter by full text search. — I'll update the question to clarify that I'm looking for something that would work also if there are really many pages in the database. – KajMagnus Jul 13 '13 at 17:47
  • 1
    You may implement [partitioning](http://www.postgresql.org/docs/current/static/ddl-partitioning.html) if the number of rows is too large. In which case indexes are local to each partition. But your question is centered on combining FTS with other indexes whereas partitioning is a more general topic. – Daniel Vérité Jul 13 '13 at 18:38
  • Partitioning could be a good approach. In my case, I think it'd require fairly much work to setup and test though. Also, since it's DDL, I wonder how it would work together with [Play Framework's automatic database evolutions module](http://www.playframework.com/documentation/2.1.2/Evolutions). – KajMagnus Jul 14 '13 at 03:18
  • 1
    Partitioning doesn't scale totally well, if the database becomes really huge — from [the docs](http://www.postgresql.org/docs/current/static/ddl-partitioning.html#DDL-PARTITIONING-CAVEATS): *"All constraints on all partitions ... are examined during constraint exclusion, so large numbers of partitions are likely to increase query planning time considerably. Partitioning ... will work well with up to perhaps a hundred partitions; don't try to use many thousands of partitions."* – KajMagnus Jul 14 '13 at 03:20
0

Another alternative, which is similar to the "site_<site-id>" and "section_<section-id>" and "page_<page-id>" alternative, should be to prefix the text to index with:

SiteSectionPage_<site-id>_<section-id>_<subsection-id>_<page-id>

And then use prefix matching (i.e. :*) when searching:

select ... from .. where .. @@ 'SiteSectionPage_NN_MMM:* AND (the search phrase)'

where NN is the site ID and MMM is the section ID.

But this won't work with Chinese? I think trigrams are appropriate when indexing Chinese, but then SiteSectionPage... will be split into: Sit, ite, teS, eSe, which makes no sense.

KajMagnus
  • 11,308
  • 15
  • 79
  • 127