0

I have a table recording the start time and end time of events of interest:

CREATE TABLE event_log (start_time DATETIME, end_time DATETIME);
INSERT INTO event_log VALUES ("2013-06-03 09:00:00","2013-06-03 09:00:05"), ("2013-06-03 09:00:03","2013-06-03 09:00:07"), ("2013-06-03 09:00:10","2013-06-03 09:00:12");

+---------------------+---------------------+
| start_time          | end_time            |
+---------------------+---------------------+
| 2013-06-03 09:00:00 | 2013-06-03 09:00:05 |
| 2013-06-03 09:00:03 | 2013-06-03 09:00:07 |
| 2013-06-03 09:00:10 | 2013-06-03 09:00:12 |
+---------------------+---------------------+

I am looking for a way to create a "time series" table where one column is a time index and another column is the count of events in progress at that time. I can do it with a subquery and a generator:

SET @first_time := (SELECT MIN(start_time) FROM event_log);
SET @last_time := (SELECT MAX(end_time) FROM event_log);

CREATE OR REPLACE VIEW generator_16
AS SELECT 0 n UNION ALL SELECT 1  UNION ALL SELECT 2  UNION ALL 
   SELECT 3   UNION ALL SELECT 4  UNION ALL SELECT 5  UNION ALL
   SELECT 6   UNION ALL SELECT 7  UNION ALL SELECT 8  UNION ALL
   SELECT 9   UNION ALL SELECT 10 UNION ALL SELECT 11 UNION ALL
   SELECT 12  UNION ALL SELECT 13 UNION ALL SELECT 14 UNION ALL 
   SELECT 15;

CREATE TABLE time_series (t DATETIME, event_count INT(11))
SELECT @first_time + INTERVAL n SECOND t, NULL AS event_count
  FROM generator_16
  WHERE @first_time + INTERVAL n SECOND <= @last_time;

UPDATE time_series 
  SET event_count= (SELECT COUNT(*) FROM event_log 
  WHERE start_time<=t AND end_time>=t);

+---------------------+-------------+
| t                   | event_count |
+---------------------+-------------+
| 2013-06-03 09:00:00 |           1 |
| 2013-06-03 09:00:01 |           1 |
| 2013-06-03 09:00:02 |           1 |
| 2013-06-03 09:00:03 |           2 |
| 2013-06-03 09:00:04 |           2 |
| 2013-06-03 09:00:05 |           2 |
| 2013-06-03 09:00:06 |           1 |
| 2013-06-03 09:00:07 |           1 |
| 2013-06-03 09:00:08 |           0 |
| 2013-06-03 09:00:09 |           0 |
| 2013-06-03 09:00:10 |           1 |
| 2013-06-03 09:00:11 |           1 |
| 2013-06-03 09:00:12 |           1 |
+---------------------+-------------+

Is there a more efficient way to do it? This method requires a subquery for every time index. Would there, for example, be a way to do it that requires one subquery per "event_log" record? My real problem has 500k time index entries and 1k events; it's taking a little longer than I would like (about 90 seconds).

The "generator" snippet came from http://use-the-index-luke.com/blog/2011-07-30/mysql-row-generator . Clearly one of the larger generators, like the 64k version or the 1M version, would be needed for larger problems.

andrewtinka
  • 593
  • 4
  • 10

1 Answers1

0

The only changes happen at start_time and end_time. So, if you were to

select distinct start_time As time_point from event_log 
UNION 
select distinct   end_time As time_point from event_log

... that would give you all the "points" at which you need a snapshot.

If you create that in a temporary table (say TEMP_POINTS), and join if back to event_log, you should be able to count the number of events at each "point".

CREATE TABLE NON_ZERO_POINTS (t DATETIME, event_count INT(11))
    select time_point, count(*)
    from TEMP_POINTS 
    join event_log on time_point between start_time and end_time
    group by time_point

Might be worth creating an index on NON_ZERO_POINTS

Then, you could use NON_ZERO_POINTS in your update thus:

UPDATE time_series 
SET event_count= (SELECT event_count FROM NON_ZERO_POINTS
WHERE t=time_point);

Also, do you need to update time_series? If not, you could just use it in a query:

select t, coalesce(event_count)
from time_series 
left join FROM NON_ZERO_POINTS
on t=time_point
Darius X.
  • 2,886
  • 4
  • 24
  • 51
  • Makes sense so far. Do you have any suggestions about the second part (filling in the missing time points?) I can't figure out any way to do it that isn't as hard as my original method (ie. requiring a subquery for every missing time point). – andrewtinka Jun 04 '13 at 12:02
  • Does that part need to be in non-procedural SQL? If not, looping over the result and sending out more rows than come in would be the way to go. If it has to be non-procedural SQL, perhaps TEMP_POINTS should start off by including every possible point. (For instance a table that contains 0 thru 59 can be cross-joined to itself to generate all 3600 possible mm:ss possibilities. – Darius X. Jun 04 '13 at 15:21
  • It does need to be in non-procedural SQL. I implemented the "join" approach that you suggested in such a way as to produce values at all times (added it to your answer). While this approach uses a "join" instead of a subquery, it unfortunately doesn't seem to speed things up at all. I tried putting various indexes on the "event log" table and it's still slow. – andrewtinka Jun 10 '13 at 19:21
  • I don't know if it would help, but perhaps you can use my earlier apporoach to summarize the event_log, but still use your generator after that. I updated the answer to show the possibilities. All the best. – Darius X. Jun 10 '13 at 20:34