-2

I need help to get the desired results in SQL or Spark. My input is below Input Table

Key   Rate  Invoice Date
Key1    10  1/1/2017
key1    10  1/5/2017
key1    20  1/20/2017
key1    10  1/25/2017
Key2    30  2/1/2017

My desired output is below:

Key   Rate  start_date  end_date
key1    10  1/1/2017    1/5/2017
key1    20  1/20/2017   1/20/2017
key1    10  1/25/2017   1/25/2017
key2    30  2/1/2017    2/1/2017
zero323
  • 322,348
  • 103
  • 959
  • 935

1 Answers1

2

This is tricky. It is called a gap-and-islands problem and one way to solve it uses the difference of row numbers:

select key, rate, min(invoice_date), max(invoice_date)
from (select t.*,
             row_number() over (order by invoice_date) as seqnum,
             row_number() over (partition by key, rate order by invoice_date) as seqnum_kr,
      from t
     ) t
group by key, rate, (seqnum - seqnum_kr);

To understand how this works, run the subquery and you'll see how the difference in row numbers defines each group.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786