1

The code below does what I need, but I don't understand why, hence I don't understand if it is reliable/correct.

I am trying to calculate the duration of each Action. Each action has the starting time marked by the action name in the Action column and the ending time marked by the word 'Done'. Two actions can overlap in time if they have different users, but not with the same user.

I wanted to search for each Action = 'Done' and the record with the same user and the highest time before that one. I tried to use Max() in different positions to find the highest Start.Time such that Start.Time < End.Time, but it never worked. I tried with Max() in a subquery, but the Start in the subquery and the Start in the Join were not the same. So I removed Max() and I got what I was looking for.

Why are the Start and End records matching correctly instead of for example matching the first and the last record of each user?

Does it reliably always pick the last record of the group according to the Order by clause? Or was I just lucky?

Here is the code that creates the table, inserts some test data and does the query:

CREATE TABLE Log (
    Time     DATETIME,
    User     CHAR,
    [Action] CHAR
);

insert into Log values('2017-01-01 10:00:00', 'Joe', 'Play');
insert into Log values('2017-01-01 10:01:00', 'Joe', 'Done');
insert into Log values('2017-01-01 10:02:00', 'Joe', 'Sing');
insert into Log values('2017-01-01 10:03:00', 'Joe', 'Done');
insert into Log values('2017-01-01 10:04:00', 'Ann', 'Play');
insert into Log values('2017-01-01 10:05:00', 'Joe', 'Play');
insert into Log values('2017-01-01 10:06:00', 'Ann', 'Done');
insert into Log values('2017-01-01 10:07:00', 'Joe', 'Done');
insert into Log values('2017-01-01 10:08:00', 'Ann', 'Play');
insert into Log values('2017-01-01 10:09:00', 'Ann', 'Done');

SELECT Start.*,
       [End].*,
       strftime('%s', [End].Time) - strftime('%s', Start.Time) AS Duration
  FROM Log AS Start
       JOIN
       Log AS [End] ON Start.User = [End].User AND 
                       Start.Time < [End].Time
 WHERE [End].[Action] = 'Done'
 GROUP BY [End].Time
 ORDER BY Duration DESC,
          Start.Time;
stenci
  • 8,290
  • 14
  • 64
  • 104
  • I *think* you are just lucky, but SQLite has some strange exceptions to SQL processing that are actually documented to work. – Gordon Linoff Mar 10 '17 at 22:35

2 Answers2

3

From the official SQLite documentation for the SELECT statement:

Special processing occurs occurs when the aggregate function is either min() or max(). Example:

SELECT a, b, max(c) FROM tab1 GROUP BY a;

When the min() or max() aggregate functions are used in an aggregate query, all bare columns in the result set take values from the input row which also contains the minimum or maximum. So in the query above, the value of the "b" column in the output will be the value of the "b" column in the input row that has the largest "c" value. There is still an ambiguity if two or more of the input rows have the same minimum or maximum value or if the query contains more than one min() and/or max() aggregate function. Only the built-in min() and max() functions work this way.

So, if you have exactly one MIN or MAX call in your query, then all other selected columns that are neither aggregate functions nor GROUP BY columns will use the row (or more accurately, a row, since that column need not necessarily be UNIQUE) where the minimum or maximum value occurs.

Otherwise, it will just return the values from some arbitrary row that's part of the group.

This, of course, is an SQLite-specific behavior and is not part of standard SQL. For example, Microsoft SQL Server gives the error:

Column 'b' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

To avoid confusion and to maximize portability, I would recommend avoiding this “feature”.

Edit: Is this what you need?

SELECT
    User,
    Start.Action AS Action,
    MAX(Start.Time) AS StartTime,
    End.Time AS EndTime,
    STRFTIME('%s', End.Time) - STRFTIME('%s', MAX(Start.Time)) AS Duration
FROM Log Start INNER JOIN Log End USING (User)
WHERE Start.Action != 'Done' AND End.Action = 'Done' AND Start.Time <= End.Time
GROUP BY User, Start.Action, End.Time
dan04
  • 87,747
  • 23
  • 163
  • 198
  • Thanks, but I tried to use the max function, with and without a subquery, but I wasn't able to figure out how to use it. Can you please show me how to make my query reliable using Max? – stenci Mar 10 '17 at 23:40
2

Without max(), you get values from some random row. In the current SQLite implementation, this is the row that the database happend to see last when handling a group, so with the rows being stored with ascending timestamps, this happens to result in the row you want. However, this behaviour is not guaranteed, and even in the current version, any change to the query could make the database read the table in a different order.

The most reliable way to look up some value from the row with the next largest timestamp is with a correlated subquery:

SELECT *,
       strftime('%s', EndTime) - strftime('%s', StartTime) AS Duration
FROM (SELECT Time AS StartTime,
             (SELECT Time
              FROM Log AS L2
              WHERE L2.User   = Log.User
                AND L2.Time   > Log.Time
                AND L2.Action = 'Done'
              ORDER BY L2.Time ASC
              LIMIT 1
             ) AS EndTime,
             User,
             Action
      FROM Log
      WHERE Action != 'Done');
CL.
  • 173,858
  • 17
  • 217
  • 259
  • I like using the limit instead of the max. I find this very easy to read. – stenci Mar 13 '17 at 14:41
  • I just asked another question, followup of this one. Can you please have a look at it? http://stackoverflow.com/questions/42766634/how-to-return-two-fields-from-a-subquery – stenci Mar 13 '17 at 14:53