With the equality predicates on external_id
and external_context
columns in the WHERE clause, MySQL could make effective use of an index... when those predicates specify the subset of rows that can possibly satisfy the query.
But with the OR
added to the WHERE
clause, now the rows to be returned from c
are not limited by external_id
and external_content
values. It's now possible that rows with other values of those columns could be returned; rows with any values of those columns.
And that negates the big benefit of using an index range scan operation... very quickly eliminating vast swaths of rows from being considered. Yes, an index range scan is used to quickly locate rows. That is true. But the meat of the matter is that the range scan operation uses the index to quickly to bypass millions and millions of rows that can't possibly be returned.
This is not behavior specific to MariaDB 10.3. We are going to observe the same behavior in MariaDB 10.2, MySQL 5.7, MySQL 5.6.
I'm questioning the join operation: Is it necessary to return multiple copies of rows from c
when there are multiple matching rows from reply_c
? Or is the specification to just return distinct rows from c
?
We can look at the required resultset as two parts.
1) the rows from app_contents
with equality predicates on external_id
and external_context
SELECT c.*
FROM app_comments c
WHERE c.external_id = '840774'
AND c.external_context = 'deals'
ORDER
BY c.external_id
, c.external_context
, c.reply_to
, c.date
For optimal performance (excluding considering a covering index because of the *
in the SELECT list), an index like this could be used to satisfy both the range scan operation and the order by (eliminating a Using filesort operation)
... ON app_comments (external_id, external_context, reply_to, date)
2) The second part of the result is the reply_to
rows related to matching rows
SELECT d.*
FROM app_comments d
JOIN app_comments e
ON e.id = d.reply_to
WHERE e.external_id = '840774'
AND e.external_context = 'deals'
ORDER
BY d.reply_to
, d.date
The same index recommended before can be used to accessing rows in e
(range scan operation). Ideally, that index would also include the id
column. Our best option is probably to modify the index to include id
column following date
... ON app_comments (external_id, external_context, reply_to, date, id)
Or, for equivalent performance, at the expense of an extra index, we could define an index like this:
... ON app_comments (external_id, external_context, id)
For accessing rows from d
with a range scan, we likely want an index:
... ON app_comments (reply_to, date)
We can combine the two sets with a UNION ALL
set operator; but there's potential for the same row being returned by both queries. A UNION
operator would force a unique sort to eliminate duplicate rows. Or we could add a condition to the second query to eliminate rows that will be returned by the first query.
SELECT d.*
FROM app_comments d
JOIN app_comments e
ON e.id = d.reply_to
WHERE e.external_id = '840774'
AND e.external_context = 'deals'
HAVING NOT ( d.external_id <=> '840774'
AND d.external_context <=> 'deals'
)
ORDER
BY d.reply_to
, d.date
Combining the two parts, wrap each part in a set of parens add the UNION ALL set operator and an ORDER BY operator at the end (outside the parens), something like this:
(
SELECT c.*
FROM app_comments c
WHERE c.external_id = '840774'
AND c.external_context = 'deals'
ORDER
BY c.external_id
, c.external_context
, c.reply_to
, c.date
)
UNION ALL
(
SELECT d.*
FROM app_comments d
JOIN app_comments e
ON e.id = d.reply_to
WHERE e.external_id = '840774'
AND e.external_context = 'deals'
HAVING NOT ( d.external_id <=> '840774'
AND d.external_context <=> 'deals'
)
ORDER
BY d.reply_to
, d.date
)
ORDER BY `reply_to`, `date`
This will need a "Using filesort" operation over the combined set, but now we've got a really good shot at getting good execution plan for each part.
There's still my question of how many rows we should return when there are multiple matching reply_to rows.