2

I have a 100K-row table representing sales during a particular time period. Usually the periods are at least a few hours long, but occasionally we get a period that's only a few minutes long. These tiny periods mess up downstream reporting, so I'd like to merge them with the preceding period. Any period that's 30 minutes or less should get merged with the previous period, with sales data summed across periods. There may be zero, one, or many multiple subsequent short periods between long periods. There are no time gaps in the data-- the start of one period is always the same as the end of the previous one.

What's a good set-based way (no cursors!) to perform this merging?

Existing data (simplified) looks like this:

UnitsSold     Start              End
---------------------------------------------------
10            06-12-2013 08:03   06-12-2013 12:07
12            06-12-2013 12:07   06-12-2013 16:05
1             06-12-2013 16:05   06-12-2013 16:09 
1             06-12-2013 16:09   06-12-2013 16:13 
7             06-12-2013 16:13   06-12-2013 20:10

Desired output would look like this:

UnitsSold     Start              End
---------------------------------------------------
10            06-12-2013 08:03   06-12-2013 12:07
14            06-12-2013 12:07   06-12-2013 16:13
7             06-12-2013 16:13   06-12-2013 20:10

Unfortunately we're still on SQL Server 2008 R2, so we can't leverage the cool new window functions in SQL Server 2012, which might make this problem easier to solve efficiently.

There's a good discussion of a similar problem in Merge adjacent rows in SQL?. I particularly like the PIVOT/UNPIVOT solution, but I'm stumped for how to adapt it to my problem.

Community
  • 1
  • 1
Justin Grant
  • 44,807
  • 15
  • 124
  • 208

3 Answers3

1

My idea is

  1. create list only with long periods
  2. find start of next long period with "outer apply"
  3. sum units with subquery

Something like this

declare @t table (UnitsSold int, start datetime, finish datetime)

insert into @t values (10, '20130612 08:03',   '20130612 12:07')
insert into @t values (12, '20130612 12:07',   '20130612 16:05')
insert into @t values (1, '20130612 16:05',   '20130612 16:09')
insert into @t values (1, '20130612 16:09',   '20130612 16:13')
insert into @t values (7, '20130612 16:13',   '20130612 20:10')

select
    (select SUM(UnitsSold) from @t t3 where t3.start>=t1.start and t3.finish<=ISNULL(oa.start, t1.finish)) as UnitsSold,
    t1.start,
    ISNULL(oa.start, t1.finish) as finish
from @t t1
outer apply (
    select top(1) start
    from @t t2
    where datediff(minute,t2.start, t2.finish)>30 
    and t2.start >= t1.finish
    order by t2.start
) oa
where datediff(minute, t1.start, t1.finish)>30 
Alexander Sigachov
  • 1,541
  • 11
  • 16
0

Using recursive CTE:

DECLARE @t TABLE (UnitsSold INT, Start DATETIME, Finish DATETIME)
INSERT INTO @t VALUES
    (10, '06-12-2013 08:03', '06-12-2013 12:07'),
    (12, '06-12-2013 12:07', '06-12-2013 16:05'),
    (1, '06-12-2013 16:05', '06-12-2013 16:09'),
    (1, '06-12-2013 16:09', '06-12-2013 16:13'),
    (7, '06-12-2013 16:13', '06-12-2013 20:10')

;WITH rec AS (
    -- Returns periods > 30 minutes
    SELECT u.UnitsSold, u.Start, u.Finish
    FROM @t u WHERE DATEDIFF(MINUTE, u.Start, u.Finish) > 30
    UNION ALL
    -- Adds on adjoining periods <= 30 minutes
    SELECT
        u.UnitsSold + r.UnitsSold,
        r.Start,
        u.Finish
    FROM rec r
    INNER JOIN @t u ON r.Finish = u.Start
    AND DATEDIFF(MINUTE, u.Start, u.Finish) <= 30)

-- Since the CTE also returns incomplete periods we need
-- to filter out the relevant periods, in this case the
-- last/max values for each start value.
SELECT
    MAX(r.UnitsSold) AS UnitsSold,
    r.Start AS Start,
    MAX(r.Finish) AS Finish
FROM rec r
GROUP BY r.Start
Kristofer
  • 675
  • 7
  • 15
0

Using CTE and cumulative sum:

DECLARE @t TABLE (UnitsSold INT, Start DATETIME, Finish DATETIME)
INSERT INTO @t VALUES
    (10, '06-12-2013 08:03', '06-12-2013 12:07'),
    (12, '06-12-2013 12:07', '06-12-2013 16:05'),
    (1, '06-12-2013 16:05', '06-12-2013 16:09'),
    (1, '06-12-2013 16:09', '06-12-2013 16:13'),
    (7, '06-12-2013 16:13', '06-12-2013 20:10')

;WITH groups AS (
    SELECT UnitsSold, Start, Finish,
        -- Cumulative sum, IIF returns 1 for each row that
        -- should generate a new row in the final result.
        SUM(IIF(DATEDIFF(MINUTE, Start, Finish) <= 30, 0, 1)) OVER (ORDER BY Start) csum
    FROM @t)

SELECT 
    SUM(UnitsSold) UnitsSold,
    MIN(Start) Start,
    MAX(Finish) Finish
FROM groups
GROUP BY csum
Kristofer
  • 675
  • 7
  • 15