0

In attempting to pull a large series of columns (~15-20) from several joined tables, I put together 2 views that would pull the necessary information. In my local DB (only ~1k posts rows), joining these views worked fine, however; when I created those same views on our production DB (~30k posts rows) and attempted to join the view, I realized that that solution wouldn't scale beyond a test dataset.

I attempted to migrate those 2 views (categories data—like categories.title—and creators' data—like users.display_name) into a CTE post_data which, in theory, would act as a keyed version of those views, and allow me to get all post data for the eligible posts.

I have put together a sample DBFiddle with some test data to explain the table structure. The actual data has many more columns, but this is representative of the joins necessary to build the query.

table : posts
+-----+-----------+------------+------------------------------------------+----------------------------------------+
| id  | parent_id | created_by |                 message                  |              attachments               |
+-----+-----------+------------+------------------------------------------+----------------------------------------+
|  8  | NULL      |          8 | laptop for sale                          | [{"media_id": 1380}]                   |
|  9  | NULL      |          4 | NEW lamp shade up for grabs              | [{"media_id": 1442}, {"link_id": 103}] |
|  10 | 1         |          7 | Oooh I could be interested               |                                        |
|  11 | 1         |          7 | DMing you now! I've been looking for one |                                        |
+-----+-----------+------------+------------------------------------------+----------------------------------------+

table : users
+----+------------------+---------------------------+
| id |   display_name   |        created_at         |
+----+------------------+---------------------------+
|  1 | John Appleseed   | 2018-02-20T00:00:00+00:00 |
|  2 | Massimo Jenkins  | 2018-05-14T00:00:00+00:00 |
|  3 | Johanna Marionna | 2018-06-05T00:00:00+00:00 |
|  4 | Jackson Creek    | 2018-11-15T00:00:00+00:00 |
|  5 | Joe Schmoe       | 2019-01-09T00:00:00+00:00 |
|  6 | John Johnson     | 2019-02-14T00:00:00+00:00 |
|  7 | Donna Madison    | 2019-05-14T00:00:00+00:00 |
|  8 | Jenna Kaplan     | 2019-06-23T00:00:00+00:00 |
+----+------------------+---------------------------+

table : categories
+----+------------+------------+-------------------------------------------------------+
| id | created_by |   title    |                      description                      |
+----+------------+------------+-------------------------------------------------------+
|  1 |          2 | Technology | Anything tech; Consumer, business or education tools! |
|  2 |          2 | Home Goods | Anything for the home                                 |
+----+------------+------------+-------------------------------------------------------+

table : categories_posts
+---------+-------------+
| post_id | category_id |
+---------+-------------+
|       8 |           1 |
|       9 |           1 |
|      10 |           1 |
|      11 |           1 |
+---------+-------------+

table : users_categories
+---------+-------------+
| user_id | category_id |
+---------+-------------+
|       1 |           1 |
|       2 |           1 |
|       3 |           1 |
|       4 |           1 |
+---------+-------------+

table : posts_removed
+---------+----------------------+------------+
| post_id |      removed_at      | removed_by |
+---------+----------------------+------------+
|      10 |  2019-01-22 09:08:14 |          7 |
+---------+----------------------+------------+

In the below query, eligible posts are determined in the base SELECT; then, the post_data CTE is joined to the result set (limited to 25 rows) and all columns from the CTE are returned.

WITH post_data AS (
    SELECT posts.id,
           posts.parent_id,
           posts.created_by,
           posts.attachments,
           categories_posts.category_id,
           categories.title,
           categories.created_by AS category_created_by,
           creator.display_name AS creator_display_name,
           creator.created_at AS creator_created_at
           /* ... And a whole bunch of other fields from posts, categories_posts, users */
    FROM posts
    LEFT OUTER JOIN categories_posts
        ON categories_posts.post_id = posts.id
    LEFT OUTER JOIN categories
        ON categories.id = categories_posts.category_id
    LEFT OUTER JOIN users creator
        ON creator.id = posts.created_by
    /* ... And a whole bunch of other joins to facilitate the selected fields */
)
SELECT post_data.*
FROM posts
        /* Set up the criteria for the posts selected before getting their data from the CTE */
    LEFT OUTER JOIN posts_removed removed ON removed.post_id = posts.id
    LEFT OUTER JOIN users user_me ON user_me.id = "1"
    LEFT OUTER JOIN users_followed ON users_followed.user_id = posts.created_by
        AND users_followed.followed_by = user_me.id
    LEFT OUTER JOIN categories_posts ON categories_posts.post_id = posts.id
    LEFT OUTER JOIN users_categories ON users_categories.category_id = categories_posts.category_id
    LEFT OUTER JOIN posts_removed pp_removed ON pp_removed.post_id = posts.parent_id
    /* Join our post_data on the post's ID */
    JOIN post_data ON post_data.id = posts.id
WHERE
(
    (
        users_categories.user_id = user_me.id AND users_categories.left_at IS NULL
    ) OR categories_posts.category_id IS NULL
) AND (
    posts.created_by = user_me.id
    OR users_followed.followed_by = user_me.id
    OR categories_posts.category_id IS NOT NULL
) AND removed.removed_at IS NULL
    AND pp_removed.removed_at IS NULL
    AND (post_data.id = posts.id OR post_data.id = posts.parent_id)
ORDER BY posts.id DESC
LIMIT 25

In theory, I thought this would work by selecting the rows based on the base select criteria, then doing an index scan for the CTE based on the Post ID; however, it seems that the query optimizer chooses instead to do a full table scan of the posts table.

The EXPLAIN SELECT gave me this information:

+----+-------------+------------------------+--------+-------------------------------+-------------+---------+---------------------------------------------+--------+----------+----------------------------------------------------+
| id | select_type |         table          |  type  |         possible_keys         |     key     | key_len |                     ref                     |  rows  | filtered |                       extra                        |
+----+-------------+------------------------+--------+-------------------------------+-------------+---------+---------------------------------------------+--------+----------+----------------------------------------------------+
|  1 | PRIMARY     | posts                  | ALL    | PRIMARY,parent_id,created_by  |             |         |                                             |  33870 |      100 | Using temporary; Using filesort                    |
|  1 | PRIMARY     | removed                | eq_ref | PRIMARY                       | PRIMARY     |       8 | posts.id                                    |      1 |       19 | Using where                                        |
|  1 | PRIMARY     | user_me                | const  | PRIMARY                       | PRIMARY     |       8 | const                                       |      1 |      100 | Using where; Using index                           |
|  1 | PRIMARY     | categories_posts       | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.posts.id                                |      1 |      100 |                                                    |
|  1 | PRIMARY     | categories             | eq_ref | PRIMARY                       | PRIMARY     |       8 | categories_posts.category_id                |      1 |      100 | Using index                                        |
|  1 | PRIMARY     | users_categories       | eq_ref | user_id_2,user_id,category_id | user_id_2   |      16 | user_me.id,api.categories_posts.category_id |      1 |      100 | Using where                                        |
|  1 | PRIMARY     | users_followed         | eq_ref | user_id,followed_by           | user_id     |      16 | posts.created_by,api.user_me.id             |      1 |      100 | Using where; Using index                           |
|  1 | PRIMARY     | pp_removed             | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.posts.parent_id                         |      1 |       19 | Using where                                        |
|  1 | PRIMARY     | <derived2>             | ALL    |                               |             |         |                                             | 493911 |       19 | Using where; Using join buffer (Block Nested Loop) |
|  2 | DERIVED     | posts                  | ALL    |                               |             |         |                                             |  33870 |      100 | Using temporary                                    |
|  2 | DERIVED     | categories_posts       | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.posts.id                                |      1 |      100 |                                                    |
|  2 | DERIVED     | categories             | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.categories_posts.category_id            |      1 |      100 |                                                    |
|  2 | DERIVED     | posts_votes            | ref    | post_id                       | post_id     |       8 | api.posts.id                                |      1 |      100 | Using index                                        |
|  2 | DERIVED     | pp                     | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.posts.parent_id                         |      1 |      100 |                                                    |
|  2 | DERIVED     | pp_removed             | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.pp.id                                   |      1 |      100 | Using index                                        |
|  2 | DERIVED     | removed                | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.posts.id                                |      1 |      100 | Using index                                        |
|  2 | DERIVED     | creator                | eq_ref | PRIMARY                       | PRIMARY     |       8 | api.posts.created_by                        |      1 |      100 |                                                    |
|  2 | DERIVED     | usernames              | ref    | user_id                       | user_id     |       8 | api.creator.id                              |      1 |      100 |                                                    |
|  2 | DERIVED     | verifications          | ALL    |                               |             |         |                                             |      4 |      100 | Using where; Using join buffer (Block Nested Loop) |
|  2 | DERIVED     | categories_identifiers | ref    | category_id                   | category_id |       8 | api.categories.id                           |      1 |      100 |                                                    |
+----+-------------+------------------------+--------+-------------------------------+-------------+---------+---------------------------------------------+--------+----------+----------------------------------------------------+

Beyond this, I tried refactoring my query to try and force key usage in the posts table, such as using FORCE INDEX(PRIMARY) in the select, and moving the CTE be the base query and adding a filter WHERE id IN ({the original base query}), but it seems the optimizer still does a full table scan.

In case it's helpful to decode what's happening in the query plan:

  • At time of writing, there are 33,387 posts rows, but the query plan shows
  • The query plan shows a full table scan which returns 33,870 rows
  • The query plan also shows the derived table (<derived2>) as having 493,911 rows

My core questions are:

  1. Am I correct when I say that subqueries should only be executed once per result row from the base select query? If so, then the CTE should also use the JOIN on posts.id and likely use the table index?

  2. Why does the query plan show that it selects 33,870 rows when there are only 33,387? And where do the 493,911 rows come from?

  3. How do you prevent a full table scan in this case?

Hugo Pakula
  • 385
  • 2
  • 4
  • 17
  • My core question would have been 'how do I optimize this query?' – Strawberry Jan 26 '20 at 22:54
  • @Strawberry that too! ;) – Hugo Pakula Jan 26 '20 at 22:55
  • 1
    One part of your query begins `SELECT post_data.* FROM posts`, and is then succeeded by a series of LEFT JOINs. Why are you LEFT JOINing tables froom which you select no data? – Strawberry Jan 26 '20 at 22:58
  • @Strawberry I don’t need the qualifying data, only to filter from those tables. If you look at the base `SELECT` query the `WHERE` removes non-qualifying posts. I also tried refactoring this query to get all the data at once AND filter the non-qualifying posts. There is still a full table scan present in that request! – Hugo Pakula Jan 26 '20 at 23:05
  • So, you are turning your left joins into inner joins with your where clause. However, the left joins may make mysql think that a full table scan is more optimal, since you seem to indicate that you want all records from your posts table. – Shadow Jan 26 '20 at 23:24
  • @Shadow are you saying it would do the full table scan for the CTE since there are no filters in it? Doesn't the `JOIN post_data ON post_data.id = posts.id` filter the CTE on a row-by-row basis? I.e. only an index search should be required – Hugo Pakula Jan 27 '20 at 00:39
  • Since post_data is in effect the posts table left joined by a bunch of other tables and you join this again on the posts table, I'm not sure why the optimiser should think to use an index in this self-join. – Shadow Jan 27 '20 at 01:01
  • How fast was the query without `JOINing` to the view (`WITH`)? – Rick James Jan 27 '20 at 01:44

1 Answers1

0

Give this a try... Do the LIMIT 25 before JOINing to the WITH:

SELECT * FROM
    ( SELECT ... FROM posts
               JOIN categories_posts ...
        ORDER BY posts.id DESC
        LIMIT 25 ) AS x
    JOIN post_data
       ON post_data.id IN (x.id, x.parent_id)
    ORDER BY posts.id DESC
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • I attempted this solution but it seems that as soon as I join the CTE, the query optimizer does a full table scan. Any thoughts on why it isn't using the merge strategy and is opting for a temptable, even when a `JOIN post_data ON post_data.id = posts.id` (or in your solution, the `JOIN ON post_data.id IN (x.id, x.parent_id)`) is present? – Hugo Pakula Jan 27 '20 at 15:48