0

I have a call log for my Widget sales staff. Each click in a customer record is logged. Staff may access the same customer account several times a day, so there could be dozens of consecutive clicks grouped together under the same recordID at different times of the day.

Example:

recordID    userID  date_event
33450   321     2013-06-20 16:22:02
33450   321     2013-06-20 16:22:02
33450   321     2013-06-20 16:22:24
33450   321     2013-06-20 16:22:24
22222   321     2013-06-20 16:22:53
22222   321     2013-06-20 16:22:54
12345   321     2013-06-20 16:23:43
12345   321     2013-06-20 16:23:44
12345   321     2013-06-20 16:24:00
12345   321     2013-06-20 16:24:05
12345   321     2013-06-20 16:24:05
12345   321     2013-06-20 18:16:09
12345   321     2013-06-20 18:16:09
33450   321     2013-06-20 18:33:24
33450   321     2013-06-20 18:35:11
33450   321     2013-06-20 18:36:55
12345   321     2013-06-20 19:01:14
98765   321     2013-06-20 19:02:43

In the data set above, I would have 6 groups of access.

        first               last                    duration(seconds)
33450   2013-06-20 16:22:02 2013-06-20 16:22:24     22
22222   2013-06-20 16:22:30 2013-06-20 16:22:54     24
12345   2013-06-20 16:23:43 2013-06-20 18:16:09     6746
33450   2013-06-20 18:33:24 2013-06-20 18:36:55     211
12345   2013-06-20 19:01:14 2013-06-20 19:01:14     0
98765   2013-06-20 19:02:43 2013-06-20 19:02:43     0

The duration is meant to be an estimate and not actual time where eyes are on a record. I can't detect when Staff actively using this app, or when they are using a client side tool, looking up data in another application or website.

The table structure is:

CREATE TABLE IF NOT EXISTS `record_log` (
  `event_id` int(11) NOT NULL AUTO_INCREMENT,
  `userID` int(5) DEFAULT NULL,
  `recordID` int(10) DEFAULT NULL,
  `date_event` datetime DEFAULT NULL,
  PRIMARY KEY (`event_id`),
  KEY `userID` (`userID`),
  KEY `date_event` (`date_event`),
  KEY `recordID` (`recordID`),
) ENGINE=InnoDB  DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

What would be the best way to structure a query to return the second dataset? Can this be done in one query without using too many cycles? I would potentially have thousands of accessed records on a given day.

a coder
  • 7,530
  • 20
  • 84
  • 131
  • do you have an auto_increment ID? how are your records grouped together? a group changes when there's a change in recordID? – fthiella Jun 27 '13 at 18:21
  • Yes - the auto_increment is "event_id". Records would be grouped by recordID, but the catch is that I need more than a simple total of all duration for each recordID for the given time period. The same recordID can be accessed multiple times across the course of a day. Typically there will be groups of consecutive activity. Or, a recordID may have only been accessed at one time during the day. I've tried to explain and show this in my data above. – a coder Jun 27 '13 at 18:40

3 Answers3

1

Yes, it's possible to return the resultset specified, but it's not pretty. In particular, the use of inline views (MySQL calls them "derived tables") means that the rows are going to be written to a temporary MyISAM table, and that operation can be expensive for large sets, so to get performance using this approach will likely require some predicates on the innermost query to get a reasonably sized set of data... e.g.

    WHERE q.userID = 321
      AND q.date_event >= '2012-01-01'
      AND q.date_event < '2012-01-02'

NOTE: It seems like the value of "duration" should be limited... i.e. would you want a click on Friday at 6PM to be "matched" to a click on Monday at 8AM, should that be considered part of the "duration". In the query below, I specified a maximum duration value to be 6800 seconds, such that 6800 is the largest duration that could be returned, anything that would be larger than this, gets "split" into two durations.

Here's an example of a query that returns the specified resultset:

SELECT recordID
  -- , s.userID
     , s.first
     , MAX(s.date_event) AS `last`
     , MAX(TIMESTAMPDIFF(SECOND,s.first,s.date_event)) AS duration
  -- , MAX(s.cnt) AS `cnt`
  FROM ( 
         SELECT IF(r.recordID = @record_id AND r.userID = @user_id AND r.date_event < (@date_event + INTERVAL 6800 SECOND),
                @cnt := @cnt + 1, @cnt := 1) AS `cnt`
              , IF(r.recordID = @record_id AND r.userID = @user_id AND r.date_event < (@date_event + INTERVAL 6800 SECOND),
                @first, @first := r.date_event) + INTERVAL 0 SECOND AS `first`
              , @record_id := r.recordID AS recordID
              , @user_id := r.userID AS userID
              , @date_event := r.date_event  AS date_event 
           FROM ( SELECT @record_id := NULL, @user_id := NULL, @date_event := NULL, @cnt := 0, @first := NULL) i
           JOIN ( SELECT q.recordID, q.userID, q.date_event
                    FROM record_log q
                   ORDER BY q.userID, q.date_event, q.recordID
                 ) r
       ) s
 GROUP
    BY s.first
     , s.userID
     , s.recordID
 ORDER
    BY s.first
     , s.userID
     , s.recordID

NOTE: this query assumes that a "duration" on one record will "broken up" by a "duration" on another record. (If a user clicks on a record, then clicks on a different record, and then comes back to the original record for some more clicks, the clicks on the original record would get counted as two separate durations.


sample data:

INSERT INTO record_log (recordID, userID, date_event) VALUES
 ('33450','321','2013-06-20 16:22:02')
,('33450','321','2013-06-20 16:22:02')
,('33450','321','2013-06-20 16:22:24')
,('33450','321','2013-06-20 16:22:24')
,('22222','321','2013-06-20 16:22:53')
,('22222','321','2013-06-20 16:22:54')
,('12345','321','2013-06-20 16:23:43')
,('12345','321','2013-06-20 16:23:44')
,('12345','321','2013-06-20 16:24:00')
,('12345','321','2013-06-20 16:24:05')
,('12345','321','2013-06-20 16:24:05')
,('12345','321','2013-06-20 18:16:09')
,('12345','321','2013-06-20 18:16:09')
,('33450','321','2013-06-20 18:33:24')
,('33450','321','2013-06-20 18:35:11')
,('33450','321','2013-06-20 18:36:55')
,('12345','321','2013-06-20 19:01:14')
,('98765','321','2013-06-20 19:02:43')
spencer7593
  • 106,611
  • 15
  • 112
  • 140
  • Quick follow-up. Why use `q.date_event >= '2012-01-01' AND q.date_event < '2012-01-02'` instead of `DATE(q.date_event) = '2012-01-01'`? – a coder Jun 27 '13 at 19:10
  • 1
    Good question. In a predicate (WHERE clause), we don't want to wrap a column in a function, because doing that disables an index on the column from being used to satisfy the query. With the range on the bare column, that can use an index. But if we use `DATE(q.date_event)`, then that effectively forces MySQL to evaluate the expression (i.e. call the DATE function) for EVERY row in the table. (Actually, that's the worst case; if rows are filtered out by another predicate first, then MySQL can skip evaluating the expression for the rows that are already filtered out.) – spencer7593 Jun 27 '13 at 20:02
0

I would use a SQL query with variables:

SELECT
  recordID,
  userID,
  MIN(date_event) first,
  MAX(date_event) last,
  TIME_TO_SEC(TIMEDIFF(MAX(date_event), MIN(date_event))) sec
FROM (
  SELECT
    events.*,
    CASE WHEN @last_recordID=recordID THEN @grp ELSE @grp:=@grp+1 END groupID,
    @last_recordID := recordID
  FROM
    events, (SELECT @grp:=0, @last_recordID:=NULL) r
  ORDER BY
    event_ID
  ) s
GROUP BY
  recordID,
  userID,
  groupID
ORDER BY
  groupID

Please see fiddle here.

fthiella
  • 48,073
  • 15
  • 90
  • 106
0

The simplest query to return your dataset would be:

SELECT recordID, MIN(date_event) AS `first`, MAX(date_event) AS `last`
, TIMESTAMPDIFF(SECOND, MIN(date_event), MAX(date_event)) AS `duration(seconds)`
FROM `record_log`
GROUP BY recordID

Another option that would probably be faster is to just figure out a reasonable estimate of duration for each time a customer record is accessed. The query below uses 30 seconds duration per access:

SELECT recordID, COUNT(*) AS staff_clicks, 30*COUNT(*) AS `estimated duration(seconds)`
FROM `record_log`
GROUP BY recordID

These are both very basic, but they do answer your original question. There are a lot of options but it's hard to know what to prescribe without more info (time constraints, level of accuracy needed, etc)

Andrew G
  • 349
  • 1
  • 6
  • This is indeed the simpler query. Unfortunately, it doesn't return the specified resultset. For recordID 33450, for example, it would return a single row, with a duration `2013-06-20 16:22:02` - `2013-06-20 18:36:55`, rather than two separate durations as shown in the specification. – spencer7593 Jun 27 '13 at 20:10