Consolidate adjacent, overlapping and embedded ranges into mutually exclusive ranges

Question

The environment is SQL Server 2014.

I am dealing with reducing many insurance enrollment details (small ranges of first and last) into much larger mutually exclusive (ME) ranges of continuous enrollment.

For clarity the problem is reduced to sample data sorted by id, first, last. F(n) and L(n) are the first and last values in record n within id.

Most detail ranges are typical

adjacent, F(n) = L(n-1) + 1

But there is devil in the details -- welcome to real world data.

connected non-adjacent, F(n) <= L(n-1)
- embedded, L(n) <= L(n-1)
- overlap, L(n) > L(n-1)
disconnected non-adjacent
- gap defines boundaries of consolidated ranges that are mutually exclusive
- ME(i).last = max of prior L

This picture demonstrates most cases

Have
  1      30       60       90      120
  +-------+--------+--------+--------+
1 +-------+                             (1:30)
2          +-------+                    (31:60) adjacent
3             +--+                      (40:50) embedded
4                   +                   (61:61) adjacent some earlier
5                   +-+                 (61:65) adjacent some earlier
6                   +--+                (61:75) adjacent some earlier
7                     +--+              (65:80) overlap
8                          +---------+  (85:120) gap, boundaries of ME ranges located
9                            +-------+  (91:120)
10                                +--+  (110:120)

Want

  1      30       60       90      120
  +-------+--------+--------+--------+
1 +----------------------+              (1:80)
2                          +---------+  (85:120)

There are other unusual cases, such as embed followed by gap

.....
  ..
      ....
AAAAA BBBB


DROP TABLE #Details
CREATE TABLE #Details (id int, first int, last int);

insert into #Details values (1,   1, 30);
insert into #Details values (1,  31, 60);
insert into #Details values (1,  40, 50);
insert into #Details values (1,  61, 75);
insert into #Details values (1,  65, 80);
insert into #Details values (1,  85, 120);
insert into #Details values (1,  91, 120);
insert into #Details values (1, 110, 120);

I read some answers on stack and Refactoring Ranges but couldn't make the leap to my data arrangement.

--For jpw--

Typical analysis may involve 20,000 ids with 200 detail records. These cases have been handled by downloading to a local machine and processed (in a cursor like manner) in a SAS Data step. Worst case is order of 650K ids and 150M detail -- too much data for downloading way, and leads to other resources issues. I believe all details might be in range of 1.2B rows. Regardless, if it can all be done in SQL server the whole process is simplified.

Approximately how many rows are there in the table? Hundreds, thousands, millions or more? — jpw, Mar 15 '15 at 23:18
Typical analysis may involve 20,000 ids with 200 detail records. — Richard, Mar 17 '15 at 06:02
I tried my approach on a larger data set and it turned out to not work at all... going to think a bit more about it. — jpw, Mar 17 '15 at 14:36
I am working a couple examples for you Richard. I've dealt with something similar. Just trying to find my code. Also with 2014 the LEAD window function is helpful. — Matt, Mar 25 '15 at 23:16

score 1 · Accepted Answer · answered Mar 26 '15 at 19:04

All right this answer will get you close. Feels a little over baked to me, but definitely is on the right track. I am sure you can tailor this to your needs. The crux of the problem is establishing overlap families. I used a recursive cte after establishing parent list starts. Please see my explanation below for more detail.

Initial Data

USERID      RangeStart  RangeEnd
----------- ----------- -----------
1           1           2
1           2           4
1           3           5
1           6           7
2           1           3
2           5           9
2           11          14
2           14          15

Query

DECLARE @USERID TABLE (USERID INT, RangeStart INT, RangeEnd INT)
INSERT INTO @USERID (USERID, RangeStart,RangeEnd) VALUES
(1,1,2),(1,2,4),(1,3,5),(1,6,7),
(2,1,3),(2,5,9),(2,11,14),(2,14,15)

;WITH Data AS (
    SELECT  ROW_NUMBER() OVER (ORDER BY USERID, RangeStart) AS MasterOrdering,
            USERID,
            RangeStart,
            RangeEnd,
            LAG(RangeStart) OVER (PARTITION BY USERID ORDER BY RangeStart ASC) AS PreviousStart,
            LAG(RangeEnd) OVER (PARTITION BY USERID ORDER BY RangeStart ASC) AS PreviousEnd
    FROM    @USERID
), ParentChild AS (
    SELECT  *,
            Parent  =   CASE
                            WHEN PreviousStart IS NULL AND PreviousEnd IS NULL THEN MasterOrdering
                            WHEN PreviousEnd NOT BETWEEN RangeStart AND RangeEnd THEN MasterOrdering
                            ELSE 0
                        END
    FROM    Data
), Family AS (
    SELECT  MasterOrdering,
            USERID,
            RangeStart,
            RangeEnd,
            PreviousStart,
            PreviousEnd,
            Parent
    FROM    ParentChild
    WHERE   Parent > 0
    UNION   ALL
    SELECT  A.MasterOrdering,
            A.USERID,
            A.RangeStart,
            A.RangeEnd,
            A.PreviousStart,
            A.PreviousEnd,
            F.Parent
    FROM    ParentChild AS A
            INNER JOIN Family AS F ON ( A.MasterOrdering = F.MasterOrdering + 1 
                                        AND A.parent = 0)
)
SELECT  USERID, 
        MIN(RangeStart) AS RangeStart, 
        MAX(RangeEnd) AS RangeEnd,
        MIN(MasterOrdering) AS MasterOrdering
FROM    Family
GROUP   BY UserID,Parent
ORDER   BY MIN(MasterOrdering)

Results

USERID      RangeStart  RangeEnd    MasterOrdering
----------- ----------- ----------- --------------------
1           1           5           1
1           6           7           4
2           1           3           5
2           5           9           6
2           11          15          7

Query Walk Through

Assumptions

SQL Server 2014
A reasonable understanding of window functions
Firm grasp of CTE's and Recursive CTE's in particular.

Step By Step

Begins by querying for Data. This uses the ROW_NUMBER function to establish a sequential ORDER based on the USERID and Ascending RangeStarts. Its used later to organize the list back to this order. The LAG functions retrieve the previous rows PreviousStart and PreviousEnd dates. This also gets used when establishing parents and the number gets used as a family identifier based on that parent id.
ParentChild populates the Parent column based on 2 rules. If the previousstart and previousend values are NULL it means that within the partion they are the first item. We assign them automatically as a parent row. Then if the PreviousEnd is not between the start and end range we then use that as a parent as well.
Family is where the real magic is. Using a recursive CTE we query for all the parents and then union all the non parents to their associated master order + 1. The plus + 1 ensures we populate all the 0's and the A.parent = 0 predicate ensures we are only joining non assigned family members to the parent range.
In the final out we simply have a min and max group by the userid and the parent column (which now is a unique number per overlapping family).

Have a look. Great question and a bit of a fun brain teaser.

Cheers

Matt

Consolidate adjacent, overlapping and embedded ranges into mutually exclusive ranges

1 Answers1