Calculating consecutive range of dates with a value in Hive

Question

I want to know if it is possible to calculate the consecutive ranges of a specific value for a group of Id's and return the calculated value(s) of each one. Given the following data:

+----+----------+--------+
| ID | DATE_KEY | CREDIT |
+----+----------+--------+
|  1 |     8091 |    0.9 |
|  1 |     8092 |     20 |
|  1 |     8095 |   0.22 |
|  1 |     8096 |   0.23 |
|  1 |     8098 |   0.23 |
|  2 |     8095 |     12 |
|  2 |     8096 |     18 |
|  2 |     8097 |      3 |
|  2 |     8098 |   0.25 |
+----+----------+--------+

I want the following output:

+----+-------------------------------+
| ID | RANGE_DAYS_CREDIT_LESS_THAN_1 |
+----+-------------------------------+
|  1 |                             1 |
|  1 |                             2 |
|  1 |                             1 |
|  2 |                             1 |
+----+-------------------------------+

In this case, the ranges are the consecutive days with credit less than 1. If there is a gap between date_key column, then the range won't have to take the next value, like in ID 1 between 8096 and 8098 date key. Is it possible to do this with windowing functions in Hive?

Thanks in advance!

score 0 · Answer 1 · answered Apr 18 '18 at 21:37

0

You can do this with a running sum classifying rows into groups, incrementing by 1 every time a credit<1 row is found(in the date_key order). Thereafter it is just a group by.

select id,count(*) as range_days_credit_lt_1
from (select t.*
      ,sum(case when credit<1 then 0 else 1 end) over(partition by id order by date_key) as grp
      from tbl t
     ) t
where credit<1
group by id

answered Apr 18 '18 at 21:37

Vamsi Prabhala

48,685
4
36
58

The problem with this solution is that returns a different output that the expected, which is one row per **consecutive** days with credit less than 1... In the example, for ID 1, this will be the rows, with value 1, 2, 1 – Lucas Mignone Apr 19 '18 at 11:58

score 0 · Answer 2 · answered Nov 27 '20 at 10:11

The key is to collapse all the consecutive sequence and compute their length, I struggled to achieve this in a relatively clumsy way:

with t_test as 
(
select num,row_number()over(order by num) as rn
from
(
select explode(array(1,3,4,5,6,9,10,15)) as num
)
)
select length(sign)+1 from
(
select explode(continue_sign) as sign
from 
(
select split(concat_ws('',collect_list(if(d>1,'v',d))), 'v') as continue_sign
from 
(
select t0.num-t1.num as d from t_test t0 
join t_test t1 on t0.rn=t1.rn+1
)
)
)

Get the previous number b in the seq for each original a;
Check if a-b == 1, which shows if there is a "gap", marked as 'v';
Merge all a-b to a string, and then split using 'v', and compute length.

To get the ID column out, another string which encode id should be considered.

Calculating consecutive range of dates with a value in Hive

2 Answers2