Identifying the boundaries of N-groups

Question

I have spent quite some time dealing with the following:

Imagine that you have N number of groups with multiple records each and every record has unique starting and ending points.

In other words:

ID|GroupName|StartingPoint|EndingPoint|seq(row_number)|desired_seq
__|_________|_____________|___________|_______________|____________
1 | Grp1    |2014-01-06   |2014-01-07 |1              |1
__|_________|_____________|___________|_______________|____________
2 | Grp1    |2014-01-07   | 2014-01-08|2              |2
__|_________|_____________|___________|_______________|____________
3 | Grp2    |2014-01-08   | 2014-01-09|1              |1
__|_________|_____________|___________|_______________|____________
4 | Grp1    |2014-01-09   | 2014-01-10|3              |1
__|_________|_____________|___________|_______________|____________
5 | Grp2    |2014-01-10   | 2014-01-11|2              |1
__|_________|_____________|___________|_______________|____________

As you can see, the starting point for every consecutive record is the same as the ending point of the previous.

Basically, I would like to obtain the minimumS and maximumS for each group based on the dates. Once a record with new group name appears, then consider it as a new group and reset the sequencing.

Single row_number() function is not sufficient enough for this task since it doesnt reflect the change in the group names.(I have included a seq column in the sample data which represents the values generated by row number)

Desired result based on the sample data:

1  Grp1    |2014-01-06   |  2014-01-08  
2  Grp2    |2014-01-08   |  2014-01-09
3  Grp1    |2014-01-09   |  2014-01-10
4  Grp2    |2014-01-10   |  2014-01-11

What I have tried:

;with cte as(
select *
, row_number() over (partition by GroupName order by startingpoint) as seq
from table1
)
select * 
into #temp2
from cte t1
left join cte t2 on t1.id=t2.id and t1.seq= t2.seq-1

select * 
,(select startingPoint from #temp2 t2 where t1.id=t2.id and t2.seq= (select MIN(seq) from #temp2) as Oldest
(select startingPoint from #temp2 t2 where t1.id=t2.id and t2.seq= (select MAX(seq) from #temp2) as MostRecent
from #temp2 t1

Judging from the table it seems like you could just use `MIN` and `MAX` unless I'm missing something. — Zane, Jan 09 '14 at 15:59

Anon · Accepted Answer · 2014-01-09T18:25:59.407

This is a gaps-and-islands problem with subgrouping. The trick is grouping by the difference between two ROW_NUMBER() values, one partitioned and one unpartitioned.

WITH t AS (
  SELECT
    GroupName,
    StartingPoint,
    EndingPoint,
    ROW_NUMBER() OVER(PARTITION BY GroupName ORDER BY StartingPoint)
      - ROW_NUMBER() OVER(ORDER BY StartingPoint) AS SubGroupId
  FROM #test
)
SELECT
  ROW_NUMBER() OVER (ORDER BY MIN(StartingPoint)) AS SortOrderId,
  GroupName                                       AS GroupName,
  MIN(StartingPoint)                              AS GroupStartingPoint,
  MAX(EndingPoint)                                AS GroupEndingPoint
FROM t
GROUP BY GroupName, SubGroupId
ORDER BY SortOrderId

Oleg Sklyar · Answer 2 · 2014-01-09T15:52:21.280

0

Not sure, but maybe:

SELECT DISTINCT 
    GroupName, 
    MIN(StartingPoint) OVER (PARTITION BY GroupName ORDER BY Id), 
    MAX(EndingPoint) OVER (PARTITION BY GroupName ORDER BY Id)
FROM table1

Because partition does not lead to the reduction of number of rows there will be originally duplicated entries, which are removed with distinct.

edited Jan 09 '14 at 15:52

answered Jan 09 '14 at 15:42

Oleg Sklyar

9,834
6
39
62

score 0 · Answer 3 · answered Jan 09 '14 at 15:48

0

This is so much easier with the lag() functionality in SQL Server 2012. The way I approach these problems is to find where groups start, assigning a flag of 1 or 0 to each row. Then take a cumulative sum of the 1s to get a new group id.

In SQL Server 2008, you can do this with correlated subqueries (or joins):

with table1_flag as (
      select t1.*,
             isnull((select top 1 1
                     from table1 t2
                     where t2.groupname = t1.groupname and
                           t2.endingpoint = t1.startingpoint
                    ), 0) as groupstartflag
      from table1 t1
     ),
     table1_flag_cum as (
      select tf.*,
             (select sum(groupstartflag)
              from table1_flag tf2
              where tf2.groupname = tf.groupname and
                    tf2.startingpoint <= tf.startingpoint
             ) as groupnum
      from table1_flag tf
     )
select groupnum, groupname,
       min(startingpoint) as startingpoint, max(endingpoint) as endingpoint
from table1_flag_cum
group by groupnum, groupname;

answered Jan 09 '14 at 15:48

Gordon Linoff

1,242,037
58
646
786

Thanks for you help. I have tested the query [SQLFiddle](http://sqlfiddle.com/#!3/87a45/2), however I am not able to adjust it as per my reuirements. Your query returns 07-10 for Grp1 and 08-11 for Grp2, which means that grps 2 is included in grp1 – Kiril Rusev Jan 09 '14 at 16:24
@Kiril . . . It is including the `groupname` in every comparison, including the final `group by`. The groups should not be interfering with each other. – Gordon Linoff Jan 09 '14 at 16:30
Well.This is what I would expect but, I am still geting multiple Groups associated with the same date period. – Kiril Rusev Jan 09 '14 at 16:36
I guess, my question is a bit missleadnig. Essentially what I want is to flag the row as a new group every time when the previous row contains different GroupName. Also, if there is a new row containing the exact same GroupName as any of the previous rows, consider it as new group. – Kiril Rusev Jan 09 '14 at 16:47

Identifying the boundaries of N-groups

3 Answers3