1

I need to group data by id and find max/min (date_from, date_to). But if there is date gap it should be new row.

I have following data:

SYS_ID  ITEM_ID DATE_FROM   DATE_TO
1       1       01.01.2019  20.01.2019
1       1       15.01.2019  10.02.2019
1       1       15.02.2019  20.02.2019
1       1       18.02.2019  10.03.2019
1       1       10.03.2019  22.03.2019
1       2       01.01.2019  10.01.2019
1       2       15.01.2019  25.01.2019

Result should be:

SYS_ID  ITEM_ID DATE_FROM   DATE_TO
1       1       01.01.2019  10.02.2019
1       1       15.02.2019  22.03.2019
1       2       01.01.2019  10.01.2019
1       2       15.01.2019  25.01.2019

Is there a way to do this without using cursor?

Michael Buen
  • 38,643
  • 9
  • 94
  • 118
N.Grave
  • 11
  • 1
  • Your intervals do not have gaps. They are overlapping. – ceving Apr 29 '19 at 15:49
  • 1
    Possible duplicate of [Merge overlapping time intervals, how?](https://stackoverflow.com/questions/19259519/merge-overlapping-time-intervals-how) – ceving Apr 29 '19 at 15:50
  • Can two periods ever start on the same date for the same date for the same sys/item pair? – Gordon Linoff Apr 29 '19 at 16:41
  • @ceiving They are indeed overlapping, but they still do have gaps. See the answer: https://stackoverflow.com/questions/55906748/group-rows-by-id-and-find-max-mindate-from-date-to-with-date-gaps/55907142#55907142 – Michael Buen Apr 29 '19 at 17:11

1 Answers1

3

Use gaps and islands approach

Live test: http://sqlfiddle.com/#!18/0174b/3

with gap_detector as
(
     select
        sys_id, item_id,
        date_from, date_to,
        case when 
            lag(date_to) 
            over(partition by sys_id, item_id order by date_from) >= date_from
        then
            0
        else
            1
        end as gap
     from tbl
 )
 , grouper as
 (
     select
         sys_id, item_id,
         date_from, date_to,
         sum(gap) over(partition by sys_id, item_id order by date_from) as grp
     from gap_detector   
)  
select
    sys_id, item_id,
    min(date_from) as date_from,
    max(date_to) as date_to
from grouper
group by sys_id, item_id, grp

Output:

| sys_id | item_id |  date_from |    date_to |
|--------|---------|------------|------------|
|      1 |       1 | 2019-01-01 | 2019-02-10 |
|      1 |       1 | 2019-02-15 | 2019-03-22 |
|      1 |       2 | 2019-01-01 | 2019-01-10 |
|      1 |       2 | 2019-01-15 | 2019-01-25 |

How it works

First we need to detect if date_to from previous row (using lag) overlaps with current date_from.

Note that we have independent sets of date_from, that is, the previous row of sys_id + item_id combo (e.g., 1,1) does not overlap with another sys_id + item_id combo (e.g., 1,2). So the first previous date_to of 1,2 is not March 22, 2019, it's NULL instead. We can properly identify the previous row of each sys_id + item_id combo by partitioning them, i.e., partition by sys_id, item_id.

With that said here is how we can identify if date_to from previous row overlaps with current date_from:

  • If the current date_from overlaps with previous date_to, don't isolate the current date_from from the previous row, we can do this by giving the current row a value of 0.
  • Otherwise, if the current date_from does not overlap with previous date_to, isolate (in other word gap) the current row from previous row, by marking the current row as a gap, we can do this by giving it a value of 1. It will come later why we need 1 and 0.

Live test: http://sqlfiddle.com/#!18/0174b/7

with gap_detector as
(
     select
        sys_id, item_id,
        date_from, date_to,
        case when 
            lag(date_to) 
            over(partition by sys_id, item_id order by date_from) >= date_from
        then
            0
        else
            1
        end as gap
     from tbl
)
select * 
from gap_detector
order by sys_id, item_id, date_from

Output:

| sys_id | item_id |  date_from |    date_to | gap |
|--------|---------|------------|------------|-----|
|      1 |       1 | 2019-01-01 | 2019-01-20 |   1 |
|      1 |       1 | 2019-01-15 | 2019-02-10 |   0 |
|      1 |       1 | 2019-02-15 | 2019-02-20 |   1 |
|      1 |       1 | 2019-02-18 | 2019-03-10 |   0 |
|      1 |       1 | 2019-03-10 | 2019-03-22 |   0 |
|      1 |       2 | 2019-01-01 | 2019-01-10 |   1 |
|      1 |       2 | 2019-01-15 | 2019-01-25 |   1 |        

Next step is to group the the islands that belong to each other by doing a running total over the gap markers (1 and 0). Running total is done by doing a sum(gap) over the window of sys_id + item_id combo.

Each window of sys_id + item_id combo can be operated on independently by doing a partition on them, i.e., partition by sys_id, item_id

Live test: http://sqlfiddle.com/#!18/0174b/12

with gap_detector as
(
     select
        sys_id, item_id,
        date_from, date_to,
        case when 
            lag(date_to) 
            over(partition by sys_id, item_id order by date_from) >= date_from
        then
            0
        else
            1
        end as gap
     from tbl
 )
 , grouper as
 (
     select
         sys_id, item_id,
         date_from, date_to,
         gap,
         sum(gap) over(partition by sys_id, item_id order by date_from) as grp
     from gap_detector   
)  
select sys_id, item_id, date_from, date_to, gap, grp
from grouper

Output:

| sys_id | item_id |  date_from |    date_to | gap | grp |
|--------|---------|------------|------------|-----|-----|
|      1 |       1 | 2019-01-01 | 2019-01-20 |   1 |   1 |
|      1 |       1 | 2019-01-15 | 2019-02-10 |   0 |   1 |
|      1 |       1 | 2019-02-15 | 2019-02-20 |   1 |   2 |
|      1 |       1 | 2019-02-18 | 2019-03-10 |   0 |   2 |
|      1 |       1 | 2019-03-10 | 2019-03-22 |   0 |   2 |
|      1 |       2 | 2019-01-01 | 2019-01-10 |   1 |   1 |
|      1 |       2 | 2019-01-15 | 2019-01-25 |   1 |   2 |

Finally, now that we are able to identify which islands belong to each other (denoted by grp), it's just a matter of doing a group by on those grp markers to identify when date_from and date_to started on each group (grp) of islands.

Live test: http://sqlfiddle.com/#!18/0174b/13

select
    sys_id, item_id,
    min(date_from) as date_from,
    max(date_to) as date_to
from grouper
group by sys_id, item_id, grp

Output:

| sys_id | item_id |  date_from |    date_to |
|--------|---------|------------|------------|
|      1 |       1 | 2019-01-01 | 2019-02-10 |
|      1 |       1 | 2019-02-15 | 2019-03-22 |
|      1 |       2 | 2019-01-01 | 2019-01-10 |
|      1 |       2 | 2019-01-15 | 2019-01-25 |
Community
  • 1
  • 1
Michael Buen
  • 38,643
  • 9
  • 94
  • 118
  • I'm not a SQL Server guy, so sorry if this is a silly question. Is `sum(x) over (...` a running total? Like `sum(gap) over(partition by sys_id, item_id order by date_from) as grp`. It sure looks like it in your output. – Andrew Apr 30 '19 at 20:59
  • @Andrew Indeed, it's doing a running total. You just need a field that is always increasing in value (e.g., `date_from`) to `order by` upon, Otherwise, the `sum(x) over(...` running total idiom would not work. See the non-working running total on the first query of this answer, it's using an `order by` on a field that is non-increasing: https://stackoverflow.com/questions/55877663/postgresql-get-relative-average-with-group-by/55878800#55878800 – Michael Buen May 01 '19 at 00:03
  • Excellent, thank you. That same SQL doesn't work as a running sum in Teradata, where I was testing this.It requires `rows unbounded preceding` to work as running sum. – Andrew May 01 '19 at 14:16
  • @Andrew Tried `rows unbounded preceding`, it works on Sql Server, same output. Looks like `rows unbounded preceding` is the default windows frame if no windows frame is specified – Michael Buen May 01 '19 at 22:54