I have a hive table with the following structure:
id1, id2, year, value
1, 1, 2000, 20
1, 1, 2002, 23
1, 1, 2003, 24
1, 2, 1999, 34
1, 2, 2000, 35
1, 2, 2001, 37
2, 3, 2005, 50
2, 3, 2006, 56
2, 3, 2008, 60
I have 2 ids which identify the 'user', and for each user and year I have a value, but there are years with no values which do not appear in the table. I would like to add for each id [id1,id2] and year (considering all the years between the minimum and maximum year) a value, using the previous year value in case a year does not exists. So the table should become:
id1, id2, year, value
1, 1, 2000, 20
1, 1, 2001, 20
1, 1, 2002, 23
1, 1, 2003, 24
1, 2, 1999, 34
1, 2, 2000, 35
1, 2, 2001, 37
2, 3, 2005, 50
2, 3, 2006, 56
2, 3, 2007, 56
2, 3, 2008, 60
I need to do that in hive or pig, or in the worst case I could go with spark
thanks,