0

I have a time series data in Impala that in this format.

One record get created when and only when there is a change, updated value represents the new data.

---------------------------------------
| Product | Year | Week | UpdatedValue |
---------------------------------------
|A        | 2017 | 1    | 5            |
|A        | 2017 | 5    | 10           |
|A        | 2017 | 20   | 80           |
|B        | 2017 | 8    | 90           |
|...      | ...  | ...  | ...          |
---------------------------------------

Assuming that our time window is 2017 full year from week one to week 52. The data above says that the value got changed for product A at the first week to be 5, 5'th week to be 10 and 20th week to be 80. I want to fill in the missing value by using the last observation carry forward logic, and if the data doesn't start from the first week, then also fill in the leading missing values with the next occurrence.

This should be the ideal output.

---------------------------------------
| Product | Year | Week | UpdatedValue |
---------------------------------------
|A        | 2017 | 1    | 5            |
|A        | 2017 | .    | 5            |
|A        | 2017 | 4    | 5            |
|A        | 2017 | 5    | 10           |
|A        | 2017 | 6    | 10           |
|A        | 2017 | .    | 10           |
|A        | 2017 | 20   | 80           |
|A        | 2017 | .    | 80           |
|A        | 2017 | 52   | 80           |
|B        | 2017 | 1    | 90           |
|B        | 2017 | .    | 90           |
|B        | 2017 | 8    | 90           |
|B        | 2017 | .    | 90           |
|B        | 2017 | 52   | 90           |
|...      | ...  | ...  | ...          |
---------------------------------------

Is there a feasible solution Hive/Impala you can certain use their advanced analytics functions if that is easier? However, if a generic SQL solution is available, that will be great.

Vamsi Prabhala
  • 48,685
  • 4
  • 36
  • 58
B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178

1 Answers1

1

Step 1: Create a table with numbers.

create table if not exists tblNumbers
location 'hdfs_location' as 
select 1 as num
union all 
select 2 as num
....

Step 2: Cross join the unique products,year with numbers and left join the original table to generate missing rows for a product. Then use a running sum logic to generate groups to group successive missing value rows, after which you can use a max to generate the values for missing rows from the last found value.

select product,year,week,max(val) over(partition by product,year,grp) as new_val
from (select py.product,py.year,n.week,t.val
      ,sum(case when t.val is not null then 1 else 0 end) 
       over(partition by py.product,py.year order by n.week) as grp
      from tblNumbers n
      cross join (select distinct product,year from tbl) py
      left join tbl t on n.week = t.week and py.product = t.product and py.year = t.year
     ) t
Vamsi Prabhala
  • 48,685
  • 4
  • 36
  • 58
  • max(val) is referring to a column that doesn't exist, should the value also be included before the sum? – B.Mr.W. Nov 27 '18 at 18:40