0

My requirement is to calculate based on an incremental size window for a batch table.

For example, the first window has 1 row, the second window has 2 rows(including 1 row from the 1st window and a new row), then 3 rows in the 3rd window(including 2 rows from the 2nd window and a new row), and so on.

For example:

Source table:

datetime | productId | price |

3-1 | p1 | 10 |

3-2 | p1 | 20 |

3-3 | p1 | 30 |

3-4 | p1 | 40 |

Result table:

datetime | productId | average|

3-1 | p1 | 10/1 |

3-2 | p1 | (10+20)/2 |

3-3 | p1 | (10+20+30)/3 |

3-4 | p1 | (10+20+30+40)/4 |

I am trying to find a way to implement this requirement with Sql, to me seems the OVER action can do that but not yet implemented in flink, so I need an alternative way.

BTW:

I tried to use a TUMBLE window of 1 day and store the previous value in the user defined aggregation object but failed as the aggregation object will be reused by all product not a single object for each product

yinhua
  • 337
  • 4
  • 18

1 Answers1

1

The OVER clause on a batch table is not supported by Flink's SQL yet. You can track the status of this effort here.

However, did you consider to implement this behavior on a streaming table instead? Streaming tables can also read from static files such as CSV files and many operations are supported there as well. This depends on the other operations you want to use in your query, though.

twalthr
  • 2,584
  • 16
  • 15
  • Thanks, I will try to see if streaming has all the functionality for my application. – yinhua Mar 27 '18 at 01:35
  • Looks not feasible, I have to do a join on two tables – yinhua Mar 27 '18 at 01:44
  • is there any other workaround like play with a user defined function to solve it? – yinhua Mar 27 '18 at 02:17
  • Joins will be available in 1.5 for streaming but maybe not as performant as on batch. You could try an aggregation function (did you group by product id?). Otherwise I think you have to do it with the DataSet API for now. – twalthr Mar 27 '18 at 08:48
  • Yes, I group by productId and I tried to store the previous calculation result in aggregation function, but it doesn't work as the accumulation object is reused and also I saw the window is not calculated with the time order when parallelism is bigger than 1 – yinhua Mar 27 '18 at 10:06