I have a monitoring app that stores metrics in MySQL DB and uses RRDtool to draw time series graphs. Tables in the DB have 3 columns: timestamp, RRDKeyID and value, ie. there's a table with data like:
ts1 | user2 | some_value
ts1 | user4 | some_value
ts1 | user5 | some_value
ts1 | user7 | some_value
ts2 | user1 | some_value
ts2 | user2 | some_value
ts2 | user3 | some_value
ts2 | user5 | some_value
ts2 | user8 | some_value
ts3 | user3 | some_value
ts3 | user4 | some_value
ts3 | user5 | some_value
ts3 | user7 | some_value
...
Data is collected in 1-minute intervals and users (as you know them) connect freely to the system and generate metrics, so at any given moment there is a number of users in the system that is lower than the total. The total number of users is about 1k and there's always a few hundreds of them online, which then means that there are a few hundreds rows with the same timestamp in the metric table. When I generate a graph I do not want to show the series for every single user, because then the graph is unreadable due to canvas size too small, limited color pallette, legend too long etc. That is why I instead generate a graph where only top 10 users by metric value are shown individually and all others are aggregated into one black area. This is how I do it:
:1: I get top 10 from the table:
select RRDKeyID, avg(value) as avg
from metric_table
where ts between 't1' and 't2'
group by RRDKeyID
order by avg desc
limit 10
:2: Then in Perl I generate RRDtool command that draws the graph where every user (RRDKeyID) has own DEF, but only DEFs of users from the SQL query results are plotted directly, all the rest is aggregated:
CDEF:others=0,user11,ADDNAN,user12,ADDNAN,user13,ADDNAN,...,userN
and then "AREA:others#000000:OTHERS:STACK"
is displayed on the graph.
Now I'm working on the next version of the app where RRDtool will be replaced by D3.js due to changes in the DB schema, so I need another way to generate the data of my top10 + others. I don't want to send raw data to the client and do the processing on their side, because the data sets may be very big and processing may take much time and CPU, so I'd rather do it on the server side. I suspect that it may be possible to get the results I want in the following format:
ts | user | value
------+--------+-------
t1 | u1 | v
t1+1 | u1 | v
t1+2 | u1 | v
...
t2 | u1 | v
t1 | u2 | v
...
t2 | u2 | v
t1 | u3 | v
...
t2 | u3 | v
...
...
t1 | u10 | v
...
t2 | u10 | v
t1 | others | v
...
t2 | others | v
at one go (order not important) using a nested query where there is a subquery that selects user names (RRDKeyIDs) and avg(value) over a given range of time, and then the top query prints the results for user names if they are found in the subquery results or otherwise add them to others. I don't know exactly how to express this idea in SQL, so I'll be grateful if someone could suggest a solution.