-1

I'm trying to get a running total using a Subquery. (I'm using Metabase, which doesn't seem to accept/process variables in queries)

My Query:

SELECT date_format(t.`session_stop`, '%d') AS `session_stop`, 
    sum(t.`energy_used` / 1000) AS `csum`,
    (
      SELECT (SUM(a.`energy_used`) / 1000)
      FROM `sessions` a 
      WHERE date_format(a.`session_stop`, '%Y-%m-%d') <=  date_format(t.`session_stop`, '%Y-%m-%d') 
      AND str_to_date(concat(date_format(a.`session_stop`, '%Y-%m'), '-01'), '%Y-%m-%d') = str_to_date(concat(date_format(now(), '%Y-%m'), '-01'), '%Y-%m-%d')
      ORDER BY str_to_date(date_format(a.`session_stop`, '%e'), '%d') ASC
    ) AS `sum`
    FROM `sessions` t
    WHERE str_to_date(concat(date_format(t.`session_stop`, '%Y-%m'), '-01'), '%Y-%m-%d') = str_to_date(concat(date_format(now(), '%Y-%m'), '-01'), '%Y-%m-%d')
    GROUP BY date_format(t.`session_stop`, '%e')
    ORDER BY str_to_date(date_format(t.`session_stop`, '%d'), '%d') ASC;

This takes about 1.29secs to run. (43K rows in total, returns 14)

If I remove the sum(t.`energy_used` / 1000) AS `csum`, line, the query takes up 8 mins and 40 secs.

Why is this? I'd rather not have that line, but I also can't wait 8mins for a query to process.

(I know I can create a cumulative column, but I'm especially interested why this additional sum() speeds the whole query up)

ps. tested this on both the MySQL console and the Metabase interface.

EXPLAIN query:

+----+--------------------+-------+------+---------------+------+---------+------+-------+---------------------------
| id | select_type        | table | type | possible_keys | key  | key_len | ref  | rows  | Extra
+----+--------------------+-------+------+---------------+------+---------+------+-------+---------------------------
|  1 | PRIMARY            | t     | ALL  | NULL          | NULL | NULL    | NULL | 42055 | Using where; Using tempora
|  2 | DEPENDENT SUBQUERY | a     | ALL  | NULL          | NULL | NULL    | NULL | 42055 | Using where
+----+--------------------+-------+------+---------------+------+---------+------+-------+---------------------------
2 rows in set (0.00 sec)

Without the extra sum():

+----+--------------------+-------+------+---------------+------+---------+------+-------+----------------------------------------------+
| id | select_type        | table | type | possible_keys | key  | key_len | ref  | rows  | Extra                                        |
+----+--------------------+-------+------+---------------+------+---------+------+-------+----------------------------------------------+
|  1 | PRIMARY            | t     | ALL  | NULL          | NULL | NULL    | NULL | 44976 | Using where; Using temporary; Using filesort |
|  2 | DEPENDENT SUBQUERY | a     | ALL  | NULL          | NULL | NULL    | NULL | 44976 | Using where                                  |
+----+--------------------+-------+------+---------------+------+---------+------+-------+----------------------------------------------+
2 rows in set (0.00 sec)

Schema is not much more than a table with:

session_id (INT, auto incr., prim.key) | session_stop (datetime) | energy_used (INT) |
 1                           | 1-1-2016 10:00:00       | 123456            |
 2                           | 1-1-2016 10:05:00       | 123456            |
 3                           | 1-2-2016 10:10:00       | 123456            |
 4                           | 1-2-2016 12:00:00       | 123456            |
 5                           | 3-3-2016 14:05:00       | 123456            |

Some examples on the internets show using the ID for the WHERE-clause, but I had some poor results with this.

puredevotion
  • 1,135
  • 1
  • 11
  • 27

1 Answers1

1

Your queries are not similar at all. In fact, they are poles apart.

If I remove the sum(t.energy_used / 1000) AS csum, line, the query takes up 8 mins and 40 secs.

When you use SUM, it's an aggregation. sum(t.energy_used/ 1000) will produce an entirely different result from just selecting t.energy_used that's why there is such a huge difference in the query timings.

It is also very unclear why you are comparing dates in this manner:

WHERE date_format(a.`session_stop`, '%Y-%m-%d') <=      date_format(t.`session_stop`, '%Y-%m-%d') 

Why are you converting them both with date_format before comparision? Since both tables apparently contain the same data type, you should be able to do a.session_stop <= t.session_stop this will be much faster for both cases.

Since it's an inequality comparison, it's not a good candidate for indexes but you can still try creating an index on that column to see if it has any effect.

So to recap, the performance difference is because you are not merely adding/removing an extra column but adding/removing an aggregation.

e4c5
  • 52,766
  • 11
  • 101
  • 134