1

I want to calculate the rolling average looking forward on "4 day window" basis. Please find the details below

Create table stock(day int, time String, cost float);

Insert into stock values(1,"8 AM",3.1);
Insert into stock values(1,"9 AM",3.2);
Insert into stock values(1,"10 AM",4.5);
Insert into stock values(1,"11 AM",5.5);
Insert into stock values(2,"8 AM",5.1);
Insert into stock values(2,"9 AM",2.2);
Insert into stock values(2,"10 AM",1.5);
Insert into stock values(2,"11 AM",6.5);
Insert into stock values(3,"8 AM",8.1);
Insert into stock values(3,"9 AM",3.2);
Insert into stock values(3,"10 AM",2.5);
Insert into stock values(3,"11 AM",4.5);
Insert into stock values(4,"8 AM",3.1);
Insert into stock values(4,"9 AM",1.2);
Insert into stock values(4,"10 AM",0.5);
Insert into stock values(4,"11 AM",1.5); 

I wrote the below query

select day, cost,sum(cost) over (order by day range between current row and 4 Following), avg(cost) over (order by day range between current row and 4 Following) 
from stock

As you can see, I get 4 records for each day and I need to calculate rolling average on 4 day window. For this, I wrote the above window query, as I have the data only for 4 days each day containing 4 records, the sum for the first day will be the total of all the 16 records. Based on this the first record will have the sum of 56.20 which is correct and the average should be 56.20/4 (as there are 4 days), but it is doing 56.20/16 as there 16 records in total. How do I fix the average part of this?

Thanks Raj

Raj
  • 401
  • 6
  • 20

1 Answers1

1

Is this what you want?

select t.*,
       avg(cost) over (order by day range between current row and 4 following)
from t;

EDIT:

What you seem to want is:

select t.*,
       (sum(cost) over (order by day range between current row and 3 following) /
        count(distinct day) over (order by day range between current row and 3 following)
       )
from t;

But, Hive does not support this. You can use a subquery for this purpose:

select t.*,
       (sum(cost) over (order by day range between current row and 3 following) /
        sum(case when seqnum = 1 then 1 else 0 end) over (order by day range between current row and 3 following)
       )
from (select t.*
             row_number() over (partition by day order by time) as seqnum
      from t
     )t
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Hi, as you can see in the query. I have already tried this. The problem with this is, it will do 56/16 (4*4), but I want it to be 56/4. I need the avg to be on per day not based on all the records of each day(4*4). Any Idea? – Raj Nov 29 '19 at 03:25
  • Thanks for quick reply Gordon, you really saved my day! – Raj Nov 29 '19 at 03:40