MySQL: select top 10 and aggregate others

Question

I have a monitoring app that stores metrics in MySQL DB and uses RRDtool to draw time series graphs. Tables in the DB have 3 columns: timestamp, RRDKeyID and value, ie. there's a table with data like:

ts1 | user2 | some_value
ts1 | user4 | some_value
ts1 | user5 | some_value
ts1 | user7 | some_value
ts2 | user1 | some_value
ts2 | user2 | some_value
ts2 | user3 | some_value
ts2 | user5 | some_value
ts2 | user8 | some_value
ts3 | user3 | some_value
ts3 | user4 | some_value
ts3 | user5 | some_value
ts3 | user7 | some_value
...

Data is collected in 1-minute intervals and users (as you know them) connect freely to the system and generate metrics, so at any given moment there is a number of users in the system that is lower than the total. The total number of users is about 1k and there's always a few hundreds of them online, which then means that there are a few hundreds rows with the same timestamp in the metric table. When I generate a graph I do not want to show the series for every single user, because then the graph is unreadable due to canvas size too small, limited color pallette, legend too long etc. That is why I instead generate a graph where only top 10 users by metric value are shown individually and all others are aggregated into one black area. This is how I do it:

:1: I get top 10 from the table:

select RRDKeyID, avg(value) as avg
from metric_table
where ts between 't1' and 't2'
group by RRDKeyID
order by avg desc
limit 10

:2: Then in Perl I generate RRDtool command that draws the graph where every user (RRDKeyID) has own DEF, but only DEFs of users from the SQL query results are plotted directly, all the rest is aggregated:

CDEF:others=0,user11,ADDNAN,user12,ADDNAN,user13,ADDNAN,...,userN

and then "AREA:others#000000:OTHERS:STACK" is displayed on the graph.

Now I'm working on the next version of the app where RRDtool will be replaced by D3.js due to changes in the DB schema, so I need another way to generate the data of my top10 + others. I don't want to send raw data to the client and do the processing on their side, because the data sets may be very big and processing may take much time and CPU, so I'd rather do it on the server side. I suspect that it may be possible to get the results I want in the following format:

 ts   | user   | value 
------+--------+-------
 t1   | u1     | v 
 t1+1 | u1     | v 
 t1+2 | u1     | v 
...
 t2   | u1     | v 
 t1   | u2     | v 
...
 t2   | u2     | v 
 t1   | u3     | v 
...
 t2   | u3     | v 
...
...
 t1   | u10    | v 
...
 t2   | u10    | v 
 t1   | others | v 
...
 t2   | others | v

at one go (order not important) using a nested query where there is a subquery that selects user names (RRDKeyIDs) and avg(value) over a given range of time, and then the top query prints the results for user names if they are found in the subquery results or otherwise add them to others. I don't know exactly how to express this idea in SQL, so I'll be grateful if someone could suggest a solution.

In Step 1, don't you want `ORDER BY avg(value) DESC` ? – Rick James Mar 23 '15 at 18:05 — Rick James, Mar 23 '15 at 18:05
@RickJames Of course, my bad. Edit done. – mac13k Mar 24 '15 at 04:49 — mac13k, Mar 24 '15 at 04:49

score 1 · Answer 1 · answered Mar 23 '15 at 18:21

1

How about enumerating the values in MySQL and then re-aggregating?

select (case when rn <= 10 then RRDKeyId end) as RRDKeyId,
       sum(sumv) / sum(cnt) as avgvalue
from (select (@rn := @rn + 1) as rn, mt.*
      from (select RRDKeyID, sum(value) as sumv, count(*) as cnt
            from metric_table
            where ts between 't1' and 't2'
            group by RRDKeyID 
            order by avg(value) desc
           ) mt cross join
           (select @rn := 0) vars
      ) mt
group by (case when rn <= 10 then RRDKeyId end);

This uses NULL for the extra group. You can put in another name if you like.

answered Mar 23 '15 at 18:21

Gordon Linoff

1,242,037
58
646
786

That query almost does it. I'm not sure if I was clear enough, but I actually wanted the time series for each of the top 10 and others in the results, so there should be (ts, rrdkeyid, value) in each row. The query you provided can be definitely used to determine whether the key is going to have its own series or land in the others, but what about the query that produces the series? – mac13k Mar 24 '15 at 15:20
@mac13k . . . I assume it is some form of conditional aggregation, but I look over the question and don't know what form you want the result in. – Gordon Linoff Mar 24 '15 at 19:41

score 0 · Answer 2 · answered Mar 23 '15 at 18:14

As I understand you, you want to

Discover the top 10
Get data to plot them
Get average the rest, to make an 11th plot.

Something like:

$list = 
SELECT  GROUP_CONCAT(RRDKeyID)
    FROM  
      ( SELECT  RRDKeyID, AVG(value)
            FROM  metric_table
            WHERE  ts between 't1' AND 't2'
            group by  RRDKeyID
            order by  AVG(value) desc
            limit  10 
      ) x; 
SELECT  ...              WHERE  RRDKeyID     IN ( $list ) ...; -- top 10
SELECT  ... AVG(...) ... WHERE  RRDKeyID NOT IN ( $list ) ...; -- the rest

It would be hard to combine those three statements into a single query. The IN would turn into a JOIN / LEFT JOIN, and the original query would need to be repeated.

MySQL: select top 10 and aggregate others

2 Answers2