0

I have a hive table with the following structure:

id1, id2, year, value 
1, 1, 2000, 20
1, 1, 2002, 23
1, 1, 2003, 24
1, 2, 1999, 34
1, 2, 2000, 35
1, 2, 2001, 37
2, 3, 2005, 50
2, 3, 2006, 56
2, 3, 2008, 60

I have 2 ids which identify the 'user', and for each user and year I have a value, but there are years with no values which do not appear in the table. I would like to add for each id [id1,id2] and year (considering all the years between the minimum and maximum year) a value, using the previous year value in case a year does not exists. So the table should become:

id1, id2, year, value 
1, 1, 2000, 20
1, 1, 2001, 20
1, 1, 2002, 23
1, 1, 2003, 24
1, 2, 1999, 34
1, 2, 2000, 35
1, 2, 2001, 37
2, 3, 2005, 50
2, 3, 2006, 56
2, 3, 2007, 56
2, 3, 2008, 60

I need to do that in hive or pig, or in the worst case I could go with spark

thanks,

Marc G
  • 1
  • 1

2 Answers2

0

This is best achieved if years can be stored as a table.

create table dbname.years 
location 'hdfs_location' as
select 2000 as yr union all select 2001 as yr --include as many years as possible

1) With this table in place, the id's can be cross joined to generate all year combinations and then left join ing the original table.

2) Then classify rows into groups, so a null value from the previous step (year missing from the original table for the id's) gets assigned the same group as previous non-null value. This is accomplished with a running sum. Run the sub-query to see how groups are assigned.

3) Thereafter, select the max for each id1,id2,group combination.

select id1,id2,yr,max(val) over(partition by id1,id2,grp) as val
from (select i.id1,i.id2,y.yr,t.val
      ,sum(case when t.val is null then 0 else 1 end) 
       over(partition by i.id1,i.id2 order by y.yr) as grp
      from (select distinct id1,id2 from tbl) i
      cross join (select yr from years) y
      left join tbl t on i.id1=t.id1 and i.id2=t.id2 and y.yr=t.yr
     ) t
Vamsi Prabhala
  • 48,685
  • 4
  • 36
  • 58
0

I would do this on using a temporary table. The year per id1 and id2 varies so I will create a series of years per id1, id2 instead of creating a series of years for all. 1) get the min year and max year per id1, id2. Call this series_dtes table 2) do a left join to the table at hand (I call it cal_date) 3) create a temp table out of combined series_dtes table and cal_date table. This will fill in the missing years per id1, id2 say 2001 and 2007. 4) fill in the missing values for 2001 and 2007 using lag function.

create table tmp as 
with  series_dtes as (
select id1, id2, (t.min_dt+pe.idx) as series_year
from (select id1, id2, min(year) as min_dt, max(year) as max_dt from cal_date group by id1, id2) t
lateral view posexplode(split(space(t.max_dt-t.min_dt),' ')) pe as idx, dte)
select dte.id1, dte.id2, dte.series_year, t.value
from series_dtes dte
left join cal_date  t
on dte.series_year=t.year and t.id1=dte.id1 and t.id2=dte.id2
order by dte.id1, dte.id2, dte.series_year;

select id1, id2, series_year as year, 
(case when value is null then (lag(value) over (partition by id1,id2 order by series_year)) else value end) as value
from tmp;

Result:
id1     id2     year    value
1       1       2000    20
1       1       2001    20
1       1       2002    23
1       1       2003    24
1       2       1999    34
1       2       2000    35
1       2       2001    37
2       3       2005    50
2       3       2006    56
2       3       2007    56
2       3       2008    60
jose_bacoy
  • 12,227
  • 1
  • 20
  • 38
  • this will give `null`s if there is more than one year missing in the data. unfortunately hive doesn't support `lag(..ignore nulls)` which Oracle supports for example. – Vamsi Prabhala Mar 21 '18 at 22:17