3

I have ran into the age-old problem of MySQL refusing to use an index for seemingly basic stuff. The query in question:

SELECT c.*
FROM app_comments c
LEFT JOIN app_comments reply_c ON c.reply_to = reply_c.id
WHERE (c.external_id = '840774' AND c.external_context = 'deals')
 OR (reply_c.external_id = '840774' AND reply_c.external_context = 'deals')
ORDER BY c.reply_to ASC, c.date ASC

EXPLAIN:

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  c   ALL external_context,external_id,idx_app_comments_externals NULL    NULL    NULL    903507  Using filesort
1   SIMPLE  reply_c eq_ref  PRIMARY PRIMARY 4   altero_full.c.reply_to  1   Using where

There are indexes on external_id and external_context separately, and I also tried adding a composite index (idx_app_comments_externals), but that did not help at all.

The query executes in 4-6 seconds in production (>1m records), but removing the OR part of the WHERE condition decreases that to 0.05s (it still uses filesort though). Clearly indexes don't work here, but I have no idea why. Can anyone explain this?

P.S. We're using MariaDB 10.3.18, could that be at fault here?

jurchiks
  • 1,354
  • 4
  • 25
  • 55
  • 3
    I always refer to http://mysql.rjweb.org/doc.php/index_cookbook_mysql for indexing issues. About halfway down the page there's an `OR` section that describes how indexes using `or` are utilized. It looks like a solution is to instead use `UNION`. Also see https://stackoverflow.com/questions/52043444/query-takes-too-long-to-query-with-or-clause-but-their-parts-are-very-quick for an example. – WOUNDEDStevenJones Oct 21 '19 at 16:11
  • yes @WOUNDEDStevenJones UNION can be a good optimisation trick assuming the unioned data fits into the memory and non disk i/o is required.. Consider [Internal Temporary Table Use in MySQL](https://dev.mysql.com/doc/refman/8.0/en/internal-temporary-tables.html) – Raymond Nijland Oct 21 '19 at 16:34
  • `OR` has always been an Optimization problem, and will continue to be for the foreseeable future. One answer focuses on UNION as a workaround; another focuses on CTE, which is available starting in MySQL 8.0 and MariaDB 10.2. – Rick James Oct 21 '19 at 19:06

3 Answers3

2

With the equality predicates on external_id and external_context columns in the WHERE clause, MySQL could make effective use of an index... when those predicates specify the subset of rows that can possibly satisfy the query.

But with the OR added to the WHERE clause, now the rows to be returned from c are not limited by external_id and external_content values. It's now possible that rows with other values of those columns could be returned; rows with any values of those columns.

And that negates the big benefit of using an index range scan operation... very quickly eliminating vast swaths of rows from being considered. Yes, an index range scan is used to quickly locate rows. That is true. But the meat of the matter is that the range scan operation uses the index to quickly to bypass millions and millions of rows that can't possibly be returned.


This is not behavior specific to MariaDB 10.3. We are going to observe the same behavior in MariaDB 10.2, MySQL 5.7, MySQL 5.6.


I'm questioning the join operation: Is it necessary to return multiple copies of rows from c when there are multiple matching rows from reply_c ? Or is the specification to just return distinct rows from c ?


We can look at the required resultset as two parts.

1) the rows from app_contents with equality predicates on external_id and external_context

  SELECT c.*
    FROM app_comments c
   WHERE c.external_id       = '840774'
     AND c.external_context  = 'deals'
   ORDER
      BY c.external_id
       , c.external_context
       , c.reply_to
       , c.date

For optimal performance (excluding considering a covering index because of the * in the SELECT list), an index like this could be used to satisfy both the range scan operation and the order by (eliminating a Using filesort operation)

   ... ON app_comments (external_id, external_context, reply_to, date)

2) The second part of the result is the reply_to rows related to matching rows

  SELECT d.*
    FROM app_comments d
    JOIN app_comments e
      ON e.id = d.reply_to
   WHERE e.external_id       = '840774'
     AND e.external_context  = 'deals'
   ORDER
      BY d.reply_to
       , d.date

The same index recommended before can be used to accessing rows in e (range scan operation). Ideally, that index would also include the id column. Our best option is probably to modify the index to include id column following date

   ... ON app_comments (external_id, external_context, reply_to, date, id)

Or, for equivalent performance, at the expense of an extra index, we could define an index like this:

   ... ON app_comments (external_id, external_context, id)

For accessing rows from d with a range scan, we likely want an index:

   ... ON app_comments (reply_to, date)

We can combine the two sets with a UNION ALL set operator; but there's potential for the same row being returned by both queries. A UNION operator would force a unique sort to eliminate duplicate rows. Or we could add a condition to the second query to eliminate rows that will be returned by the first query.

  SELECT d.*
    FROM app_comments d
    JOIN app_comments e
      ON e.id = d.reply_to
   WHERE e.external_id       = '840774'
     AND e.external_context  = 'deals'
  HAVING NOT ( d.external_id      <=> '840774'
           AND d.external_context <=> 'deals'
             )
   ORDER
      BY d.reply_to
       , d.date

Combining the two parts, wrap each part in a set of parens add the UNION ALL set operator and an ORDER BY operator at the end (outside the parens), something like this:

(
  SELECT c.*
    FROM app_comments c
   WHERE c.external_id       = '840774'
     AND c.external_context  = 'deals'
   ORDER
      BY c.external_id
       , c.external_context
       , c.reply_to
       , c.date
)
UNION ALL
(
  SELECT d.*
    FROM app_comments d
    JOIN app_comments e
      ON e.id = d.reply_to
   WHERE e.external_id       = '840774'
     AND e.external_context  = 'deals'
  HAVING NOT ( d.external_id      <=> '840774'
           AND d.external_context <=> 'deals'
             )
   ORDER
      BY d.reply_to
       , d.date
)
ORDER BY `reply_to`, `date`

This will need a "Using filesort" operation over the combined set, but now we've got a really good shot at getting good execution plan for each part.


There's still my question of how many rows we should return when there are multiple matching reply_to rows.

spencer7593
  • 106,611
  • 15
  • 112
  • 140
  • Skip the inner `ORDER BYs`; they are canceled out by the outer one. – Rick James Oct 21 '19 at 18:56
  • Is there any reason for having `HAVING` instead of folding the conditions into `WHERE`? – Rick James Oct 21 '19 at 18:57
  • I suggest the only useful indexes are `(external_id, external_context)` (in either order) and `(reply_to)`. – Rick James Oct 21 '19 at 19:01
  • @RickJames Either of `external_id` or `external_content` could be leading column. By including `id` column, that can avoid lookups to underlying data pages (to retrieve the value if `id`. If we eliminate the `ORDER BY` in the first SELECT, then there's no reason to include `reply_to` and `date` in the first proposed index. Similarly, if we remove the `ORDER BY` in the second SELECT, then there's no reason to include the `date` column in the index. But allowing the optimizer the option to use an index to avoid a Using filesort operation can be a performance improvement. – spencer7593 Oct 21 '19 at 19:09
  • Yes, tack on `e.id` (even though implicit in InnoDB); that would make it obvious that it "covers" `e`. – Rick James Oct 21 '19 at 19:15
  • The _outer_ `ORDER BY` requires a filesort. (Note: filesort _may_ be done in RAM.) My suggestion avoids any chance of filesorts for internal `SELECTs`. – Rick James Oct 21 '19 at 19:17
  • Thanks for the explanation, but I'm gonna go with Paul's answer since it avoids duplication, and I'm a big fan of DRY. – jurchiks Oct 22 '19 at 07:45
  • Versions of MySQL before 8.0 do not support CTE (Common Table Expression) `WITH` syntax. Similarly, earlier versions of MariaDB do not support CTE. Support for Common Table Expressions was on the list of requested enhancements for a long time. This answer first addressed the question that was asked, why MySQL wasn't making use of an index; and then extended into a technique for splitting the query into parts, suitable indexes for each part, and combining the result. – spencer7593 Oct 24 '19 at 23:15
2

MySQL (and MariaDB) cannot optimize OR conditions on different columns or tables. Note that in the context of the query plan c and reply_c are considered different tables. These queries are usually optimized "by hand" with UNION statements, which often contain a lot of code duplication. But in your case and with a quite recent version, which supports CTEs (Common Table Expressions) you can avoid most of it:

WITH p AS (
    SELECT *
    FROM app_comments
    WHERE external_id      = '840774'
      AND external_context = 'deals'
)
SELECT * FROM p
UNION DISTINCT
SELECT c.* FROM p JOIN app_comments c ON c.reply_to = p.id
ORDER BY reply_to ASC, date ASC

Good indices for this query would be a composite one on (external_id, external_context) (in any order) and a separate one on (reply_to).

You will though not avoid a "filesort", but that shouldn't be a problem, when the data are filtered to a small set.

Paul Spiegel
  • 30,925
  • 5
  • 44
  • 53
  • Note that using the `UNION` operation will eliminate duplicate rows, which is likely the actual specification. But this does differ from OP query which returns duplicate rows, multiple copies of a row when it is matched (joined) to more than one reply_to rows +10. – spencer7593 Oct 21 '19 at 21:02
  • 1
    @spencer7593 I'm assuming that `id` is primary key in the table. In the original query the primary key of the right table (`reply_c`) is used in the ON clause. That means there can only be at most one match. So no rows from the left table (`c`) are duplicated. In other words: A comment cannot be a reply to more than one other comment. – Paul Spiegel Oct 21 '19 at 23:27
  • 1
    with the guarantee that `id` is primary key or unique key, yes, I follow. Where the behavior of my solution relies on an assumption about uniqueness, I'm going to note that assumption, for the benefit of future readers. – spencer7593 Oct 22 '19 at 02:24
  • That looks like it'll do the trick, and a neat new thing to learn, thanks! – jurchiks Oct 22 '19 at 07:44
-2

However, the name index is not used for lookups in the following queries:

SELECT * FROM test
WHERE last_name='Jones' OR first_name='John';

enter link description here

snieguu
  • 2,073
  • 2
  • 20
  • 39