0

I'm am trying to find a way to compare a serives of data ranges to find gaps, however i need to exclude date ranges that are wholly within another range. Some example data:

PERSON_ID  START_DATE  END_DATE 
0001       01/05/2014  30/11/2014 
0001       01/06/2014  01/08/2014 
0001       01/07/2014  01/11/2014 
0001       01/12/2014  31/03/2015 

I know that i can use the LEAD function to compare one line to the next to see where the gap is if there is one, e.g.:

SELECT END_DATE 
FROM 
   (SELECT t.*, 
      lead(START_DATE,1) OVER (ORDER BY START_DATE) AS next_date 
    FROM table t 
   ) 
WHERE END_DATE+1<>next_date; 

The issue is that this would bring back a false positive. The second and third rows of date ranges are wholly contained within the first and therefore should not be included in the gap calculations. I know i need to amend the offset argument in the LEAD function but i'm not sure of an efficient way of doing this for hundreds of person ids. Any thoughts?

2 Answers2

2

You could try something like:

SELECT person_id
     , start_date + 1 start_date
     , end_date - 1 end_date
FROM
  (SELECT person_id
        , end_date start_date
        , lead(start_date) OVER
            (PARTITION BY person_id
             ORDER BY start_date) end_date
   FROM
     (SELECT person_id
           , start_date
           , max(end_date) KEEP
               (DENSE_RANK LAST
                ORDER BY end_date
                       , start_date
                NULLS LAST) end_date
      FROM
        (SELECT person_id
              , CONNECT_BY_ROOT start_date start_date
              , end_date
         FROM
           (SELECT person_id
                 , start_date
                 , end_date
                 , min(start_date) OVER
                     (PARTITION BY person_id) min_start_date
                 , lag(end_date) OVER
                     (PARTITION BY person_id
                      ORDER BY end_date
                             , start_date) lag_end_date
            FROM mytable)
         START WITH
            (  start_date = min_start_date
            OR start_date > lag_end_date + 1)
         CONNECT BY
                person_id = PRIOR person_id
            AND start_date > PRIOR start_date
            AND (  start_date <= PRIOR end_date + 1
                OR PRIOR end_date IS NULL))
      GROUP BY person_id
             , start_date))
WHERE end_date IS NOT NULL

This will merge overlapping ranges e.g. 01-Apr-2014 to 31-May-2014 and 01-May-2014 to 30-Jun-2014 will be treated as a single range of 01-Apr-2014 to 30-Jun-2014. It will also merge abutting ranges e.g. 01-Apr-2014 to 30-Apr-2014 and 01-May-2014 to 31-May-2014 will be treated as a single range of 01-Apr-2014 to 31-May-2014. You will need to alter this query if this is not how you wish to treat these conditions.

DrabJay
  • 2,989
  • 2
  • 13
  • 12
  • Hi DrabJay, thanks for the suggestion. I have implemented your code as written but it does seem to not work correctly. I have a case with 2 ranges, 31/03/14 - 15/08/14 and 15/08/14 - 31/03/15. I expect 31/03/14 - 31/03/15 but the code returns nothing for that person id. Any idea what could be wrong? – Chizo Ejindu Sep 23 '14 at 12:05
  • @ChizoEjindu You have two ranges 31-Mar-2014 to 15-Aug-2014 and 15-Aug-2014 to 31-Mar-2015. These overlapping ranges will, as per my comment, be merged into a single range of 31-Mar-2014 to 31-Mar-2015. Your requirement was "to find gaps". There are no gaps in this single range and hence no records are returned. – DrabJay Sep 23 '14 at 13:06
  • Ah yes I see your point, apologies for my lack of understanding! Now I understand what this is doing I can implement the rest of the code around it. Thanks for your help! – Chizo Ejindu Sep 23 '14 at 13:22
0

Since you said you are getting a false positive, you could turn it into positive by doing something like this:

SELECT * FROM table t where END_DATE
NOT IN (SELECT END_DATE 
FROM 
(SELECT t.*, 
  lead(START_DATE,1) OVER (ORDER BY START_DATE) AS next_date 
FROM table t 
) 
WHERE END_DATE+1<>next_date);

I hope it gives you a clue to get what you want without changing your offset arguments.

Neels
  • 2,547
  • 6
  • 33
  • 40
  • Hi Neels, thanks for the comment but i think i haven't explained my issue properly. Looking at the table on a line by line basis the LEAD function would pick up the gap between line 3 and 4 which in isolation is correct. However for the person id this is incorrect as line 1 wholly covers the dates in line 2 and 3 and therefore line 2 and 3 should be excluded from the calculations. The LEAD function needs to calculate only on line 1 and line 4 for this particular example – Chizo Ejindu Sep 23 '14 at 09:25
  • Could you please edit your question to give us all a clearer picture. – Neels Sep 23 '14 at 09:40