Optimizing fulltext search across multiple tables

Question

I want to search the requested term ($q) in my content table on the title & the keywords but also for the models, which are in another table and linked by a table in between. Also, I need to get the number of views in another table.

This is the query that I have been working on so far, the result is fine but it's way too slow (0.6s on average when I run it in PhpMyAdmin... We have millions of visitors per month)

SELECT DISTINCT SQL_CALC_FOUND_ROWS
    c.*,
    cv.views,
    (MATCH (c.title) AGAINST ('{$q}') * 3) Relevance1,
    MATCH (c.keywords) AGAINST ('{$q}') Relevance2,
    (MATCH (a.`name`) AGAINST ('{$q}') * 2) Relevance3
FROM
    content AS c
LEFT JOIN
    content_actors AS ca ON ca.content = c.record_num
LEFT JOIN
    actors AS a ON a.record_num = cm.actor
LEFT JOIN
    content_views AS cv ON cv.content = c.record_num
WHERE
    c.enabled = 1
GROUP BY c.title, c.length
HAVING (Relevance1 + Relevance2 + Relevance3) > 0
ORDER BY (Relevance1 + Relevance2 + Relevance3) DESC

The tables architecture looks like this:

content
record_num     title     keywords
1              Video1    Comedy, Action, Supercool
2              Video2    Comet

content_actors
content     model
1           1
1           2
2           1

actors
record_num     name
1              Jennifer Lopez
2              Bruce Willis

content_views
content     views
1           160
2           312

Here are the indexes I found by doing SHOW INDEX FROM tablename:

Table              Column_Name     Seq_in_index     Key_name     Index_type
---------------------------------------------------------------------------
content            record_num      1                PRIMARY      BTREE
content            keywords        1                keywords     FULLTEXT
content            keywords        2                title        FULLTEXT
content            title           1                title        FULLTEXT
content            description     1                description  FULLTEXT
content            keywords        1                keywords_2   FULLTEXT

content_actors     content         1                content      BTREE
content_actors     actor           2                content      BTREE
content_actor      actor           1                actor        BTREE

actors             record_num      1                PRIMARY      BTREE
actors             name            1                name         BTREE
actors             name            1                name_2       FULLTEXT

content_views      content         1                PRIMARY      BTREE
content_views      views           1                views        BTREE

Here is the EXPLAIN of the query:

ID     SELECT_TYPE     TABLE     TYPE       POSSIBLE_KEYS          KEY         ROWS      EXTRA
1      SIMPLE          c         ref        enabled_2, enabled     enabled     29210     Using where; Using temporary; Using filesort
1      SIMPLE          ca        ref        content                content     1         Using index
1      SIMPLE          a         eq_ref     PRIMARY                PRIMARY     1
1      SIMPLE          cv        eq_ref     PRIMARY                PRIMARY     1

I am using the GROUP BY to avoid duplicate content, but this group by alone seems to double the time required to process the query.

EDIT Well after playing with the query a bit, I realized that if I remove the GROUP BY I get duplicates, if I let the GROUP BY there, it doesn't take the proper Relevance3 value (without the GROUP BY, one is returning a value for Relevance3 while the other is not...)

basic rule of thumb: any field(s) used in a decision context (equality comparison, where clause, join, group by, etc...) should have an index on it. since you haven't shown any details how you've indexed your data, that's about all we can do. — Marc B, Feb 10 '16 at 14:16
do an `explain` of the query. if it's showing filesort, then you're bound by disk performance, and indexing will have very little effect. — Marc B, Feb 10 '16 at 14:50
I added the EXPLAIN to my original post, it does mention filesort... — Dacramash, Feb 10 '16 at 14:59
so, yeah. using filesort. the intermediate results are being cached to disk, and that's slow. — Marc B, Feb 10 '16 at 15:02
revamping the query, sometimes. but that's usually not possible. sometimes there's just nothing you can do - except throw "bigger" hardware at it. if the db decides that the intermediate results are too big to keep in memory, they'll be dumped to disk. in other words, "moar rams, plz" — Marc B, Feb 10 '16 at 15:05
Okay but SQL-wise would there be a better, more efficient way to do my query maybe ? — Dacramash, Feb 10 '16 at 15:07
other than moving `sql_calc_found_rows` (you don't have a `limit`, so it's a useless bit of overhead), I can't really see how the query could be revamped to become simpler. — Marc B, Feb 10 '16 at 15:10
I removed the LIMIT for stackoverflow, but there is one in the real thing ;) — Dacramash, Feb 10 '16 at 15:12

Rick James · Answer 1 · 2016-02-11T23:48:28.590

Add the MATCHes (OR'd together) to the WHERE -- this will cut back significantly on the number of rows to handle in SQL_CALC_FOUND_ROWS and eliminate the need for HAVING....

Instead of

cv.views,
...
LEFT JOIN  content_views AS cv ON cv.content = c.record_num

do

( SELECT views FROM content_views ON content = c.record_num ) AS views,

Edit

The LEFT and GROUP BY are needed because the actors are optional and there could be multiple multiple actors. Since you don't need the actor name at all, you can probably get rid of it by doing

WHERE ... AND ( EXISTS SELECT * 
                    FROM content_actors
                    JOIN actors AS a ON ...
                    WHERE MATCH (a.`name`) AGAINST ('{$q}')
                      AND ca...
              )

but that does not let you include the relevance in the ORDER BY.

So, you need to build a subquery with a UNION DISTINCT. There will be 2 SELECTs:

SELECT #1:

SELECT c.id,
       3 * MATCH(c.title) AGAINST ('{$q}')
       +   MATCH(c.keywords) AGAINST ('{$q}')  AS relevance
    FROM Content AS c
    WHERE MATCH(c.title, c.keywords) AGAINST ('{$q}')

(and have FULLTEXT(title, keywords)) This will efficiently fetch the ids forcontent` rows that are useful.

SELECT #2:

SELECT c.id,
       2*MAX(MATCH(a.actor) AGAINST ('{$q}') AS actor_rel) AS relevance
    FROM content AS c
    JOIN content_actors ca  ON ca.content = c.record_num
    JOIN actors a  ON a.record_num = ca.actor
    WHERE MATCH(a.actor) AGAINST ('{$q}')
    GROUP BY c.id;

Be sure to have content_actors: INDEX(actor) and content: INDEX(record_num). This SELECT will efficiently start with actors and work back to content. And note that it does something different than your code when two actors MATCH; hopefully my MAX is a better solution.

Now, let's put things together...

SELECT #3:

SELECT id, SUM(rel) AS relevance
    FROM ( ... select #1 ... )
         UNION ALL
         ( ... select #2 ... )
    GROUP BY id

But that is not quite all...

SELECT #4:

SELECT c.*,
       ( ... views ... ) AS views
    FROM ( ... select #3 ... ) AS u
    JOIN content c  ON c.id = u.id

I suggest you run each of these steps by hand to validate them, gradually putting all the pieces together. Yes, it is complex, but it should be quite fast.

Optimizing fulltext search across multiple tables

1 Answers1