0

I have a MYSQL query of this form:

SELECT
    employee.name,
    totalpayments.totalpaid
FROM
    employee
    JOIN (
        SELECT
            paychecks.employee_id,
            SUM(paychecks.amount) totalpaid
        FROM
            paychecks
        GROUP BY
            paychecks.employee_id
         ) totalpayments on totalpayments.employee_id = employee.id

I've recently found that this returns MUCH faster in this form:

SELECT
    employee.name,
    (
        SELECT
            SUM(paychecks.amount)
        FROM
            paychecks
        WHERE
            paychecks.employee_id = employee.id
    ) totalpaid
FROM
    employee

It surprises me that there would be a difference in speed, and that the lower query would be faster. I prefer the upper form for development, because I can run the subquery independently.

Is there a way to get the "best of both worlds": speedy results return AND being able to run the subquery in isolation?

2 Answers2

0

Likely, the correlated subquery is able to make effective use of an index, which is why it's fast, even though that subquery has to be executed multiple times.

For the first query with the inline view, that causing MySQL to create a derived table, and for large sets, that's effectively a MyISAM table.

In MySQL 5.6.x and later, the optimizer may choose to add an index on the derived table, if that would allow a ref operation and the estimated cost of the ref operation is lower than the nested loops scan.

I recommend you try using EXPLAIN to see the access plan. (Based on your report of performance, I suspect you are running on MySQL version 5.5 or earlier.)


The two statements are not entirely equivalent, in the case where there are rows in employees for which there are no matching rows in paychecks.

An equivalent result could be obtained entirely avoiding a subquery:

SELECT e.name
     , SUM(p.amount) AS total_paid
  FROM employee e 
  JOIN paychecks p
    ON p.employee_id = e.id
 GROUP BY e.id

(Use an inner join to get a result equivalent to the first query, use a LEFT outer join to be equivalent to the second query. Wrap the SUM() aggregate in an IFNULL function if you want to return a zero rather than a NULL value when no matching row with a non-null value of amount is found in paychecks.)

spencer7593
  • 106,611
  • 15
  • 112
  • 140
  • Thanks so much, spencer7593! Can I try forcing an index onto the derived table? (I am not very familiar with indices or the syntax for creating them.) – Yossi Fendel Feb 11 '15 at 17:26
  • @YossiFendel: I don't believe there's any hint for creating an index on a derived table. That's only going to happen in version 5.6 and later. The EXPLAIN output should show what kind of join operation is being used. The best performance is likely to be obtained by entirely avoiding a subquery. I've provided an example in an edit to my answer. – spencer7593 Feb 11 '15 at 17:31
  • @YossiFendel: Another option would be create a temporary table (with an index) as a result of the subquery, and then reference the temporary table in a second query. That's a hassle, but it can improve performance, especially when that same inline view will be referenced in multiple queries... because we avoid materializing it multiple times. – spencer7593 Feb 11 '15 at 17:36
  • thanks for your continuing help. For the purposes of this forum question I simplified the query into something that seemed easier to follow. I wish I could avoid subqueries entirely, but in this case (there are several of these aggregations) it isn't possible. You're probably right that indices are speeding things up in the bottom version. I probably need to learn more about how to read EXPLAIN outputs before going further with this. – Yossi Fendel Feb 11 '15 at 20:55
  • @YossiFendel: For MySQL 5.5, [8.8 Understanding the Query Execution Plan](http://dev.mysql.com/doc/refman/5.5/en/execution-plan-information.html). (There are changes with each release (5.1, 5.5, 5,6) so be sure to look at the documentation for the right version. The EXPLAIN output doesn't really make sense unless you understand the operations available to MySQL, and how those are represented in the EXPLAIN output.) – spencer7593 Feb 11 '15 at 22:06
0

Join is basically Cartesian product that means all the records of table A will be combined with all the records of table B. The output will be

number of records of table A * number of records of table b =rows in the new table
10 * 10 = 100

and out of those 100 records, the ones that match the filters will be returned in the query.

In the nested queries, there is a sample inner query and whatever is the total size of records of the inner query will be the input to the outter query that is why nested queries are faster than joins.

Danyal Sandeelo
  • 12,196
  • 10
  • 47
  • 78
  • 1
    Yes, sometimes, a correlated subquery in the SELECT list is faster than a join operation, but that's not usually the case. The description of a join operation creating a Cartesian product (of m * n rows), and then filtering out rows is not an entirely accurate description of how the JOIN operations actually operate. To actually get the behavior you describe, you'd need to write something that causes a Cartesian product to be created, for example: `SELECT a.id, b.id FROM a JOIN b HAVING a.id = b.id`. But with the join predicates in a `WHERE` clause or `ON` clause, that just doesn't happen. – spencer7593 Feb 11 '15 at 17:43