9

I have a problem with grouping my dataset in MS SQL Server.

My table looks like

 # | CustomerID | SalesDate           | Turnover
---| ---------- | ------------------- | ---------
 1 | 1          | 2016-08-09 12:15:00 |  22.50
 2 | 1          | 2016-08-09 12:17:00 |  10.00
 3 | 1          | 2016-08-09 12:58:00 |  12.00
 4 | 1          | 2016-08-09 13:01:00 |  55.00
 5 | 1          | 2016-08-09 23:59:00 |  10.00
 6 | 1          | 2016-08-10 00:02:00 |   5.00

Now I want to group the rows where the SalesDate difference to the next row is of a maximum of 5 minutes. So that row 1 & 2, 3 & 4 and 5 & 6 are each one group.

My approach was getting the minutes with the DATEPART() function and divide the result by 5:

(DATEPART(MINUTE, SalesDate) / 5)

For row 1 and 2 the result would be 3 and grouping here would work perfectly. But for the other rows where there is a change in the hour or even in the day part of the SalesDate, the result cannot be used for grouping.

So this is where I'm stuck. I would really appreciate, if someone could point me in the right direction.

Felix
  • 2,670
  • 3
  • 13
  • 21
  • check [this](http://stackoverflow.com/questions/17560829/how-can-i-subtract-a-previous-row-in-sql) or [this](http://stackoverflow.com/questions/13734976/sql-best-way-to-subtract-a-value-of-the-previous-row-in-the-query) – techspider Aug 09 '16 at 13:58
  • You could write a query that uses LEAD or LAG to check the next row and find the time difference, then use DENSE_RANK to assign a grouping number, partitioning on difference > 5 minutes. Then group on that grouping number. This is an intensive process if you are going to write it in one query against many rows. – Nick.Mc Aug 09 '16 at 13:59
  • I think this is a good first question. – Gordon Linoff Aug 09 '16 at 14:08

2 Answers2

5

You want to group adjacent transactions based on the timing between them. The idea is to assign some sort of grouping identifier, and then use that for aggregation.

Here is an approach:

  • Identify group starts using lag() and date arithmetic.
  • Do a cumulative sum of the group starts to identify each group.
  • Aggregate

The query looks like this:

select customerid, min(salesdate), max(saledate), sum(turnover)
from (select t.*,
             sum(case when salesdate > dateadd(minute, 5, prev_salesdate)
                      then 1 else 0
                 end) over (partition by customerid order by salesdate) as grp
      from (select t.*,
                   lag(salesdate) over (partition by customerid order by salesdate) as prev_salesdate
            from t
           ) t
     ) t
group by customerid, grp;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
2

EDIT

Thanks to @JoeFarrell for pointing out I have answered the wrong question. The OP is looking for dynamic time differences between rows, but this approach creates fixed boundaries.

Original Answer

You could create a time table. This is a table that contains one record for each second of the day. Your table would have a second column that you can use to perform group bys on.

CREATE TABLE [Time]
    (
        TimeId      TIME(0) PRIMARY KEY,
        TimeGroup   TIME
    )
;

-- You could use a loop here instead.
INSERT INTO [Time]
    (
        TimeId,
        TimeGroup
    )
VALUES
    ('00:00:00', '00:00:00'),    -- First group starts here.
    ('00:00:01', '00:00:00'),
    ('00:00:02', '00:00:00'),
    ('00:00:03', '00:00:00'),
    ...
    ('00:04:59', '00:00:00'),
    ('00:05:00', '00:05:00'),    -- Second group starts here.
    ('00:05:01', '00:05:00')
;

The approach works best when:

  1. You need to reuse your custom grouping in several different queries.
  2. You have two or more custom groups you often use.

Once populated you can simply join to the table and output the desired result.

/* Using the time table.
 */
SELECT
    t.TimeGroup,
    SUM(Turnover) AS SumOfTurnover
FROM
    Sales AS s
        INNER JOIN [Time] AS t      ON t.TimeId = CAST(s.SalesDate AS Time(0))
GROUP BY
    t.TimeGroup
;
David Rushton
  • 4,915
  • 1
  • 17
  • 31
  • 1
    I don't think this will work. If you have two entries whose times are 00:04:59 and 00:05:00, your solution will group them separately even though they're a second apart. This is the same defect the OP reported seeing in his own first attempt. – Joe Farrell Aug 09 '16 at 14:28
  • Yes you are right. I had misunderstood the question. The OP doesn't want hard coded boundaries. Instead they are looking for a dynamic difference. – David Rushton Aug 09 '16 at 14:31