How do I effectively select the average sum of several sums being calculated based on different timestamps in SQL?

Question

I have a database table looking like the following:

id | macaddr | load | timestamp
=========================================
 1 | 0011111 |   17 | 2012-02-07 10:00:00
 1 | 0011111 |    6 | 2012-02-07 12:00:00
 2 | 0022222 |    3 | 2012-02-07 12:00:03
 3 | 0033333 |    9 | 2012-02-07 12:00:04
 4 | 0022222 |    4 | 2012-02-07 12:00:06
 5 | 0033333 |    8 | 2012-02-07 12:00:10
...

Now, I would like to calculate the average load over all devices (= mac addresses) for different time periods, e.g., for today, yesterday, this week, this month.

An average load can be calculated by first finding out the overall load sum at different points in time (sample dates) and then calculating the average of the load sums for these sample dates. For example, if I wanted the average load of the last ten seconds (and now is 2012-02-07 12:00:10), I could decide about my sample dates to be at 12:00:02, 12:00:04, 12:00:06, 12:00:08, and 12:00:10. Then, I would calculate the load sums by summing up the most recent load values for each device:

2012-02-07 12:00:02 |  6  [= load(id=2)]
2012-02-07 12:00:04 | 18  [= load(id=2) + load(id=3) + load(id=4)]
2012-02-07 12:00:06 | 19  [= load(id=2) + load(id=4) + load(id=5)]
2012-02-07 12:00:08 | 19  [= load(id=2) + load(id=4) + load(id=5)]
2012-02-07 12:00:10 | 18  [= load(id=2) + load(id=5) + load(id=6)]

A device's load value is ignored if it is older than, e.g., an hour (happened here to id=1). The average would be 16 in this case.

Currently, I generate a rather complex query with many "UNION ALL" statements which is reeeeally slow:

SELECT avg(l.load_sum) as avg_load
FROM (
    SELECT sum(so.load) AS load_sum 
    FROM (
        SELECT * 
        FROM (
            SELECT si.macaddr, si.load 
            FROM sensor_data si WHERE si.timestamp > '2012-02-07 11:00:10' AND si.timestamp < '2012-02-07 12:00:10'
            ORDER BY si.timestamp DESC 
        ) AS sm
        GROUP BY macaddr
    ) so
    UNION ALL
    [THE SAME THING AGAIN WITH OTHER TIMESTAMPS]
    UNION ALL
    [AND AGAIN]
    UNION ALL
    [AND AGAIN]
    ...
) l

Now imagine I would like to calculate the average load for a whole month. With hourly sample dates I need to join 30x24=720 queries using the UNION ALL statement. The overall query takes nearly a minute to complete on my machine. I am sure there is a much better solution without the UNION ALL statement. However, I did not find anything useful on the Web. I would therefore be very thankful for your help!

score 1 · Answer 1 · answered Feb 07 '12 at 22:16

Use a fraction of the unix timestamp: First we formulate the hourly (3600 seconds) averages:

SELECT
  macaddr, 
  sum(CAST(load AS float))/CAST(count(*) AS float) AS loadavg,
  FLOOR(UNIX_TIMESTAMP(`timestamp`)/3600) AS hourbase
FROM sensor_data
GROUP BY macaddr,FLOOR(UNIX_TIMESTAMP(`timestamp`)/3600)

Then we average those over the month

SELECT 
  avg(loadavg) as monthlyavg,
  macaddr
FROM (
    SELECT
      macaddr, 
      sum(CAST(load AS float))/CAST(count(*) AS float) AS loadavg,
      FLOOR(UNIX_TIMESTAMP(`timestamp`)/3600) AS hourbase
    FROM sensor_data
    WHERE `timestamp` BETWEEN '2012-01-07 12:00:00' AND '2012-02-07 11:59:59'
    GROUP BY macaddr,FLOOR(UNIX_TIMESTAMP(`timestamp`)/3600)
) AS hourlies
GROUP BY macaddr, hourbase

Hi Eugen! Thanks for your post. If I understand it right, you first calculate the the hourly average load of each device, then calculate the monthly average load of each device based on the hourlies. What I need is the average of the load sums. So, every hour I want to sum up the most recent loads of all devices (every second would be even better, just too costly) and then calculate the average of all these hourly device load sums for one month. — mikey, Feb 08 '12 at 09:28

score 0 · Answer 2 · answered Feb 07 '12 at 22:12

To make things easier for yourself you should create an "hour" function, that returns a datetime with no significant figures after the hour part. So right now (2/2/2012 5:05pm) would be 2012-02-07 17:00. Here's the code for your hour function:

select dateadd(hh, DATEPART(hh, current_timestamp), DATEADD(dd, 0, datediff(dd, 0, current_timestamp)))

(replace current_timestamp in the above code with the datetime parameter of your hour function. I'll assume you created it as dbo.fnHour(), and it takes a datetime parameter.

You can then use the dbo.fnHour() as a partitioning function to query what you want. Your sql will look something like this:

select 
    avg(load) as avg_load
from (
    select dbo.fnHour(si.timestamp) [hour], macaddr, sum(load) as [load]
    from 
        sensor_data si 
    where 
        si.timestamp >= dateadd(mm, -1, current_timestamp)
    group by 
        dbo.fnHour(si.timestamp), macaddr
) as f

I haven't tested it so there may be some typos but this should be enough to get you going.

To use sample windows other than hours, just create similar functions, e.g., dbo.fnWeek(), fnDay(), fnMinute(), fn10MinuteStep(). — Giscard Biamby, Feb 07 '12 at 22:13

evan · Answer 3 · 2012-02-07T22:19:53.863

0

I may be misunderstanding what you are trying to do. It looks like you are making things a lot more complicated than they need to be using the sampling. Perhaps giving samples of what the result should look like would allow people to provide better solutions for your particular case.

mysql> SELECT * FROM `test`;
+----+-----+------+------------+
| id | mac | load | when       |
+----+-----+------+------------+
|  1 |   1 |   10 | 2012-02-01 |
|  2 |   1 |   20 | 2012-01-01 |
|  3 |   2 |   60 | 2011-09-01 |
+----+-----+------+------------+

mysql> SELECT avg(`sum_load`)
    -> FROM 
    -> (
    ->    SELECT sum( `load` ) as sum_load
    ->    FROM `test`
    ->    WHERE `when` > '2011-01-15'
    ->    GROUP BY `mac`
    -> ) as t1;
+-----------------+
| avg(`sum_load`) |
+-----------------+
|         45.0000 |
+-----------------+

mysql> SELECT avg(`sum_load`)
    -> FROM 
    -> (
    ->    SELECT sum( `load` ) as sum_load
    ->    FROM `test`
    ->    WHERE `when` > '2011-01-15' AND `when` < '2012-01-15'
    ->    GROUP BY `mac`
    -> ) as t1;
+-----------------+
| avg(`sum_load`) |
+-----------------+
|         40.0000 |
+-----------------+

edited Feb 07 '12 at 22:19

answered Feb 07 '12 at 22:12

evan

12,307
7
37
51

I understand the OQ in a way, that sample rate is not constant. So a "normal" `avg()` will overrepresentate times with dense recording. Pregrouping by fixed-length samples filters this out. – Eugen Rieck Feb 07 '12 at 22:18
@EugenRieck Could you explain that a bit more in depth. I'm not sure I completely understand. Do you mean that sometimes entries will be erroneous duplicates? – evan Feb 07 '12 at 22:23
Assume you have 1 row each for the hours 0-1,1-2,...22-23 of a day, all with load=7. You have 23 entries for the hour 23-0 of the same day, all with load=13. A "normal" avg() would come up with 10. If you pregroup by hour, you have 23 groups with avg=7 and 1 group with avg=13. The average of these groups=7.25. The latter value is what you want. – Eugen Rieck Feb 07 '12 at 22:33
Thanks Evan. What Eugen said is right. Just one thing @EugenRieck: By sample rate I mean a rate that I define. In this case, it would be one sample per hour. What you mean is the update rate of the devices which is - as you said - unknown and may vary (devices offline, jitter, ...) – mikey Feb 08 '12 at 09:52

How do I effectively select the average sum of several sums being calculated based on different timestamps in SQL?

3 Answers3