0

I have a tricky gaps and islands problem. Islands of dates must be identified across a hierarchy of fields.

This gaps and islands problem differs from the traditional type in a few ways:

  1. The dates are in a range style format in two fields (StartDate and EndDate)
  2. The grouping of islands must take into account a hierarchy of fields
  3. These data are in Cloudera 6.2x which limits some of the potential solutions (recursive cte for example)

If the EndDate of the hierarchy is on the same day or day before the StartDate of the next occurence, this should be an island. How can I accomplish this given my constraints and the sample data?

I have attempted a few solutions, including the follow pseudocode:

  1. create a row_number over a partition of the hierarchy fields ordered by those hierarchies and the StartDate.

  2. identify gaps by using case logic when days_add(end_date,1) >= lead(start_date,1) and row_number < lead(row_number,1) then 0 else 1

  3. identify islands by using case logic: when row_number = 1 then 1 when lag(gaps,1) = 1 then lag(islands,1) + 1 else lag(islands,1)

This should logically work to create the islands I can use to group and take the min of the startDate and the max of the EndDate grouped by the hierarchy. When I put the data in excel and apply this algorithm, I get the correct results. Apparently, in Cloudera there is no lag(islands,1) value during the evaluation of each record?

CREATE TABLE sampleInput (
  person int, 
  level1 int, 
  level2 int, 
  level3 int, 
  StartDate DATETIME, 
  EndDate DATETIME,
  rowNumber int)

INSERT INTO sampleInput
VALUES
(1,1,17,101,'2001-09-16','2001-09-19',1),
(1,1,17,102,'2001-09-20','2001-09-24',2),
(1,1,17,103,'2001-04-15','2001-04-25',3),
(1,1,17,104,'2001-08-02','2001-08-15',4),
(1,1,20,105,'2001-03-10','2001-03-18',5),
(1,1,20,105,'2001-04-01','2001-04-08',6),
(1,1,20,105,'2001-07-20','2001-07-25',7),
(1,1,20,106,'2001-02-20','2001-02-08',8),
(1,1,31,107,'2001-04-25','2001-04-30',9),
(1,1,31,107,'2001-05-01','2001-05-29',10),
(2,3,42,111,'2002-04-05','2002-04-05',11),
(2,3,42,111,'2002-04-06','2002-04-06',12),
(2,3,42,111,'2002-04-07','2002-04-30',13),
(2,3,42,111,'2002-05-01','2002-05-25',14),
(2,3,42,111,'2002-05-28','2002-06-01',15),
(2,3,42,111,'2002-06-04','2002-06-06',16),
(2,3,42,111,'2002-06-08','2002-06-20',17)

And:

CREATE TABLE sampleOutput (
  person int, level1 int, level2 int, level3 int, 
  StartDate DATETIME, EndDate DATETIME ,rowNumberConcat varchar(max)
)

INSERT INTO sampleOutput
VALUES
(1,1,17,101,'2001-09-16','2001-09-19','1'),
(1,1,17,102,'2001-09-20','2001-09-24','2'),
(1,1,17,103,'2001-04-15','2001-04-25','3'),
(1,1,17,104,'2001-08-02','2001-08-15','4'),
(1,1,20,105,'2001-03-10','2001-03-18','5'),
(1,1,20,105,'2001-04-01','2001-04-08','6'),
(1,1,20,105,'2001-07-20','2001-07-25','7'),
(1,1,20,106,'2001-02-20','2001-02-08','/'),
(1,1,31,107,'2001-04-25','2001-05-29','9,10'),
(2,3,42,111,'2002-04-05','2002-05-25','11,12,13,14'),
(2,3,42,111,'2002-05-28','2002-06-01','15'),
(2,3,42,111,'2002-06-04','2002-06-06','16'),
(2,3,42,111,'2002-06-08','2002-06-20','17')

I have included input and output in the following SQL fiddle: http://www.sqlfiddle.com/#!18/770768/1 The included rowNumber and rowNumberConcat fields are included only to show lineage of the input and output. The fiddle is set up for MSSQL only because there is no Cloudera option. As a side note, if I should have used a different flavour or a completely different site, please let me know.

The Impaler
  • 45,731
  • 9
  • 39
  • 76

0 Answers0