Grouping data in SQL by difference in column values

Question

I have following data in my logs table in postgres table:

logid => int (auto increment)
start_time => bigint (stores epoch value)
inserted_value => int

Following is the data stored in the table (where start time actual is not a column, just displaying start_time value in UTC format in 24 hour format)

logid   user_id    start_time       inserted_value       start time actual
 1      1          1518416562       15                   12-Feb-2018 06:22:42
 2      1          1518416622       8                    12-Feb-2018 06:23:42 
 3      1          1518417342       9                    12-Feb-2018 06:35:42 
 4      1          1518417402       12                   12-Feb-2018 06:36:42 
 5      1          1518417462       18                   12-Feb-2018 06:37:42
 6      1          1518418757       6                    12-Feb-2018 06:59:17 
 7      1          1518418808       11                   12-Feb-2018 07:00:08

I want to group and sum values according to difference in start_time

For above data, sum should be calculated in three groups:

  user_id         sum
   1              15 + 8
   1              9 + 12 + 18
   1              6 + 11

So, values in each group has 1 minute difference. This 1 can be considered as any x minutes difference.

I was also trying LAG function but could not understand it fully. I hope I'm able to explain my question.

dnswlt · Answer 1 · 2018-02-12T08:32:02.390

1

You can use a plain group by to achieve what you want. Just make all start_time values equal that belong to the same minute. For example

select user_id, start_time/60, sum(inserted_value)
from log_table
group by user_id, start_time/60

I assume your start_time column contains integers representing milliseconds, so /60 will properly truncate them to minutes. If the values are floats, you should use floor(start_time/60).

If you also want to select a human readable date of the minute you're grouping, you can add to_timestamp((start_time/60)*60) to the select list.

edited Feb 12 '18 at 08:32

answered Feb 12 '18 at 07:24

dnswlt

2,925
19
15

start_time stores epoch value in milliseconds, means number of milliseconds passed since 1-Jan-1970 – Ashutosh Feb 12 '18 at 07:47
I am afraid it is not that simple. `1518416562` is `2018-02-12T06:22:42` and `1518416622` is `2018-02-12T06:23:42`. I _think_ `06:22:42` and `06:23:42` must be grouped together but `06:22:42` and `06:23:43` should not. – Salman A Feb 12 '18 at 07:49
If 06:22:42 and 06:23:42 must be grouped together, you span more than one minute. By transitivity, 06:23:42 and 06:24:42 must then also be grouped together, which would lump all times into a single group :-) – dnswlt Feb 12 '18 at 08:07
@dnswlt Edited my question, added 2 more log entries to make question more clear – Ashutosh Feb 12 '18 at 08:27
OK, now I see! You want to sum values as long as the difference to the previous entry isn't greater than a minute. Then @SalmanA's solution using window functions is probably the way to go. – dnswlt Feb 12 '18 at 08:35

Salman A · Answer 2 · 2018-02-12T08:32:24.757

You can use LAG to check if current row is > 60 seconds more than previous row and set group_changed (a virtual column) each time this happens.

In next step, use running sum over that column. This creates a group_number which you can use to group results in the third step.

WITH cte1 AS (
    SELECT
        testdata.*,
        CASE WHEN start_time - LAG(start_time, 1, start_time) OVER (PARTITION BY user_id ORDER BY start_time) > 60 THEN 1 ELSE 0 END AS group_changed
    FROM testdata
), cte2 AS (
    SELECT
        cte1.*,
        SUM(group_changed) OVER (PARTITION BY user_id ORDER BY start_time) AS group_number
    FROM cte1
)
SELECT user_id, SUM(inserted_value)
FROM cte2
GROUP BY user_id, group_number

SQL Fiddle

Grouping data in SQL by difference in column values

2 Answers2