Subquery without additional column takes longer than with column

Question

I'm trying to get a running total using a Subquery. (I'm using Metabase, which doesn't seem to accept/process variables in queries)

My Query:

SELECT date_format(t.`session_stop`, '%d') AS `session_stop`, 
    sum(t.`energy_used` / 1000) AS `csum`,
    (
      SELECT (SUM(a.`energy_used`) / 1000)
      FROM `sessions` a 
      WHERE date_format(a.`session_stop`, '%Y-%m-%d') <=  date_format(t.`session_stop`, '%Y-%m-%d') 
      AND str_to_date(concat(date_format(a.`session_stop`, '%Y-%m'), '-01'), '%Y-%m-%d') = str_to_date(concat(date_format(now(), '%Y-%m'), '-01'), '%Y-%m-%d')
      ORDER BY str_to_date(date_format(a.`session_stop`, '%e'), '%d') ASC
    ) AS `sum`
    FROM `sessions` t
    WHERE str_to_date(concat(date_format(t.`session_stop`, '%Y-%m'), '-01'), '%Y-%m-%d') = str_to_date(concat(date_format(now(), '%Y-%m'), '-01'), '%Y-%m-%d')
    GROUP BY date_format(t.`session_stop`, '%e')
    ORDER BY str_to_date(date_format(t.`session_stop`, '%d'), '%d') ASC;

This takes about 1.29secs to run. (43K rows in total, returns 14)

If I remove the sum(t.`energy_used` / 1000) AS `csum`, line, the query takes up 8 mins and 40 secs.

Why is this? I'd rather not have that line, but I also can't wait 8mins for a query to process.

(I know I can create a cumulative column, but I'm especially interested why this additional sum() speeds the whole query up)

ps. tested this on both the MySQL console and the Metabase interface.

EXPLAIN query:

+----+--------------------+-------+------+---------------+------+---------+------+-------+---------------------------
| id | select_type        | table | type | possible_keys | key  | key_len | ref  | rows  | Extra
+----+--------------------+-------+------+---------------+------+---------+------+-------+---------------------------
|  1 | PRIMARY            | t     | ALL  | NULL          | NULL | NULL    | NULL | 42055 | Using where; Using tempora
|  2 | DEPENDENT SUBQUERY | a     | ALL  | NULL          | NULL | NULL    | NULL | 42055 | Using where
+----+--------------------+-------+------+---------------+------+---------+------+-------+---------------------------
2 rows in set (0.00 sec)

Without the extra sum():

+----+--------------------+-------+------+---------------+------+---------+------+-------+----------------------------------------------+
| id | select_type        | table | type | possible_keys | key  | key_len | ref  | rows  | Extra                                        |
+----+--------------------+-------+------+---------------+------+---------+------+-------+----------------------------------------------+
|  1 | PRIMARY            | t     | ALL  | NULL          | NULL | NULL    | NULL | 44976 | Using where; Using temporary; Using filesort |
|  2 | DEPENDENT SUBQUERY | a     | ALL  | NULL          | NULL | NULL    | NULL | 44976 | Using where                                  |
+----+--------------------+-------+------+---------------+------+---------+------+-------+----------------------------------------------+
2 rows in set (0.00 sec)

Schema is not much more than a table with:

session_id (INT, auto incr., prim.key) | session_stop (datetime) | energy_used (INT) |
 1                           | 1-1-2016 10:00:00       | 123456            |
 2                           | 1-1-2016 10:05:00       | 123456            |
 3                           | 1-2-2016 10:10:00       | 123456            |
 4                           | 1-2-2016 12:00:00       | 123456            |
 5                           | 3-3-2016 14:05:00       | 123456            |

Some examples on the internets show using the ID for the WHERE-clause, but I had some poor results with this.

a high-level review of running it thru `Explain` might help. Also, your schema. — Drew, Jul 14 '16 at 16:22
Forgot to add :( It's now there. (I assumed I could do something on indexes, but the results don't really show that) — puredevotion, Jul 14 '16 at 16:42
The explain output from the second one. Ok, if I have time, I will create a dataset and try to recreate it. This all takes time. You wanted info on the difference or reason, and you only gave 1 explain output. — Drew, Jul 15 '16 at 09:37
Added, but it's (as expected) the same. In the end, the same tables and records are used. (the difference in # of rows is due to an addition of new records this morning) — puredevotion, Jul 15 '16 at 11:04
"Schema is not much more than a table with:" but what are the indexes?? — e4c5, Jul 15 '16 at 12:45

e4c5 · Answer 1 · 2016-07-17T13:17:25.903

Your queries are not similar at all. In fact, they are poles apart.

If I remove the sum(t.energy_used / 1000) AS csum, line, the query takes up 8 mins and 40 secs.

When you use SUM, it's an aggregation. sum(t.energy_used/ 1000) will produce an entirely different result from just selecting t.energy_used that's why there is such a huge difference in the query timings.

It is also very unclear why you are comparing dates in this manner:

WHERE date_format(a.`session_stop`, '%Y-%m-%d') <=      date_format(t.`session_stop`, '%Y-%m-%d')

Why are you converting them both with date_format before comparision? Since both tables apparently contain the same data type, you should be able to do a.session_stop <= t.session_stop this will be much faster for both cases.

Since it's an inequality comparison, it's not a good candidate for indexes but you can still try creating an index on that column to see if it has any effect.

So to recap, the performance difference is because you are not merely adding/removing an extra column but adding/removing an aggregation.

Subquery without additional column takes longer than with column

1 Answers1