2

Sample data below

    id   start  end
    a     1      3
    a     5      6
    a     8      9
    b     2      4
    b     6      7
    b     9      10
    c     2      4
    c     6      7
    c     9      10

I'm trying to come up with a query that will return all the overlap start-end inclusive between a, b, and c (but extendable to more). So the expected data will look like the following

    start  end
     2      3
     6      6
     9      9

The only way I can picture this is with a custom aggregate function that tracks the current valid intervals then computes the new intervals during the iterate phase. However I can't see this approach being practical when working with large datasets. So if some bright mind out there have a query or some innate function that I'm not aware of I would greatly appreciate the help.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
user1111111
  • 403
  • 2
  • 4
  • 13

2 Answers2

2

You can do this using aggregation and a join. Assuming no internal overlaps for "a" and "b":

select greatest(ta.start, tb.start) as start,
       least(ta.end, tb.end) as end
from t ta join
     t tb
     on ta.start <= tb.end and ta.end >= tb.start and
        ta.id = 'a' and tb.id = 'b';
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks for your kind words above and your solution here. I fliped your last condition from ta.id = tb.id to <> and it works so well. Thank you very much! P.S. However the only concern I have with this is when working with very large datasets wouldn't joining a table onto itself drastically hinder performance? – user1111111 Apr 12 '16 at 19:23
  • @user3758659 Should not be ta.id < tb.id to exclude repeating and include also the ta.id and tb.id in the select? – Mottor Apr 12 '16 at 20:09
  • It doesn't seem like this works generally for cases where there are more than 2 different ids. – Mike Apr 12 '16 at 20:16
  • After reading some of the comments here and testing it out for myself I find that this does indeed not work well for increased number of ids. However I do still very much appreciate the answer! – user1111111 Apr 13 '16 at 00:25
  • @user3758659 . . . Ooops, I have the wrong condition on the ids. – Gordon Linoff Apr 13 '16 at 00:46
  • I have tested it and I do not see anything wrong here: SELECT ta.id,tb.id,GREATEST (ta.r_start, tb.r_start) as r_start, least(ta.r_end, tb.r_end) as r_end from t ta join t tb on ta.r_start <= tb.r_end and ta.r_end >= tb.r_start and ta.id < tb.id – Mottor Apr 13 '16 at 07:19
1

This is a lot uglier and more complex than Gordon's solution, but I think it gives the expected answer better and should extend to work with more ids:

WITH NUMS(N) AS (  --GENERATE NUMBERS N FROM THE SMALLEST START VALUE TO THE LARGEST END VALUE
  SELECT MIN("START") N FROM T
  UNION ALL
  SELECT N+1 FROM NUMS WHERE N < (SELECT MAX("END") FROM T)
),
SEQS(N,START_RANK,END_RANK) AS (
  SELECT N,
    CASE WHEN IS_START=1 THEN ROW_NUMBER() OVER (PARTITION BY IS_START ORDER BY N) ELSE 0 END START_RANK, --ASSIGN A RANK TO EACH RANGE START
    CASE WHEN IS_END=1 THEN ROW_NUMBER() OVER (PARTITION BY IS_END ORDER BY N) ELSE 0 END END_RANK --ASSIGN A RANK TO EACH RANGE END
  FROM (
          SELECT N,
              CASE WHEN NVL(LAG(N) OVER (ORDER BY N),N) + 1 <> N THEN 1 ELSE 0 END IS_START, --MARK N AS A RANGE START
              CASE WHEN NVL(LEAD(N) OVER (ORDER BY N),N) -1 <> N THEN 1 ELSE 0 END IS_END /* MARK N AS A RANGE END */ 
              FROM (
                SELECT DISTINCT N FROM ( --GET THE SET OF NUMBERS N THAT ARE INCLUDED IN ALL ID RANGES
                  SELECT NUMS.*,T.*,COUNT(*) OVER (PARTITION BY N) N_CNT,COUNT(DISTINCT "ID") OVER () ID_CNT 
                  FROM NUMS
                  JOIN T ON (NUMS.N >= T."START" AND NUMS.N <= T."END")
                  ) WHERE N_CNT=ID_CNT
              )
    ) WHERE IS_START + IS_END > 0
)
SELECT STARTS.N "START",ENDS.N "END" FROM SEQS STARTS
JOIN SEQS ENDS ON (STARTS.START_RANK=ENDS.END_RANK AND STARTS.N <= ENDS.N) ORDER BY "START"; --MATCH CORRESPONDING RANGE START/END VALUES 

First we generate all the numbers between the smallest start value and the largest end value.

Then we find the numbers that are included in all the provided "id" ranges by joining our generated numbers to the ranges, and selecting each number "n" that appears once for each "id".

Then we determine whether each of these values "n" starts or ends a range. To determine that, for each N we say: If the previous value of N does not exist or is not 1 less than current N, current N starts a range. If the next value of N does not exist or is not 1 greater than current N, current N ends a range.

Next, we assign a "rank" to each start and end value so we can match them up.

Finally, we self-join where the ranks match (and where the start <= the end) to get our result.

EDIT: After some searching, I came across this question which shows a better way to find the start/ends and refactored the query to:

WITH NUMS(N) AS (  --GENERATE NUMBERS N FROM THE SMALLEST START VALUE TO THE LARGEST END VALUE
  SELECT MIN("START") N FROM T
  UNION ALL
  SELECT N+1 FROM NUMS WHERE N < (SELECT MAX("END") FROM T)
)
SELECT MIN(N) "START",MAX(N) "END" FROM (
  SELECT N,ROW_NUMBER() OVER (ORDER BY N)-N GRP_ID
                FROM (
                  SELECT DISTINCT N FROM ( --GET THE SET OF NUMBERS N THAT ARE INCLUDED IN ALL ID RANGES
                    SELECT NUMS.*,T.*,COUNT(*) OVER (PARTITION BY N) N_CNT,COUNT(DISTINCT "ID") OVER () ID_CNT 
                    FROM NUMS
                    JOIN T ON (NUMS.N >= T."START" AND NUMS.N <= T."END")
                    ) WHERE N_CNT=ID_CNT
  )
) 
GROUP BY GRP_ID ORDER BY "START";
Community
  • 1
  • 1
Mike
  • 1,039
  • 1
  • 12
  • 20
  • Thanks for the answer Mike, I've tested your solution and it works like a charm. I will do some analysis myself and get a good understanding of your solution. Thank you very much. – user1111111 Apr 13 '16 at 00:26