2

Recently I am trying to use Apache Flink for fast batch processing. I have a table with a column:value and an irrelevant index column

Basically I want to calculate the mean and range of every 5 rows of value. Then I am going to calculate the mean and standard deviation based on those mean I just calculated. So I guess the best way is to use Tumble window.

It looks like this

DataSet<Tuple2<Double, Integer>> rawData = {get the source data};
Table table = tableEnvironment.fromDataSet(rawData);
Table groupedTable = table
            .window(Tumble.over("5.rows").on({what should I write?}).as("w")
            .groupBy("w")
            .select("f0.avg, f0.max-f0.min");

{The next step is to use groupedTable to calculate overall mean and stdDev} 

But I don't know what to write in .on(). I have tried "proctime" but it said there is no such input. I just want it to group by the order as it reads from the source. But it has to be a time attribute so I cannot use "f2" - the index column as ordering as well.

Do I have to add a timestamp to do this? Is it necessary in batch processing and will it slow down the calculation? What is the best way to solve this?

Update : I tried to use a sliding window in the table API and it gets me Exception.

// Calculate mean value in each group
    Table groupedTable = table
            .groupBy("f0")
            .select("f0.cast(LONG) as groupNum, f1.avg as avg")
            .orderBy("groupNum");

//Calculate moving range of group Mean using sliding window
    Table movingRangeTable = groupedTable
            .window(Slide.over("2.rows").every("1.rows").on("groupNum").as("w"))
            .groupBy("w")
            .select("groupNum.max as groupNumB, (avg.max - avg.min) as MR");

The Exception is:

Exception in thread "main" java.lang.UnsupportedOperationException: Count sliding group windows on event-time are currently not supported.

at org.apache.flink.table.plan.nodes.dataset.DataSetWindowAggregate.createEventTimeSlidingWindowDataSet(DataSetWindowAggregate.scala:456)

at org.apache.flink.table.plan.nodes.dataset.DataSetWindowAggregate.translateToPlan(DataSetWindowAggregate.scala:139)

...

Does that mean that sliding window is not supported in Table API? If I recall correctly there is no window function in DataSet API. Then how do I calculate moving range in batch process?

Jin.J
  • 353
  • 1
  • 4
  • 13

1 Answers1

0

The window clause is used to define a grouping based on a window function, such as Tumble or Session. Grouping every 5 rows is not well defined in the Table API (or SQL) unless you specify the order of the rows. This is done in the on clause of the Tumble function. Since this feature originates from stream processing, the on clause expects a timestamp attribute.

You can fetch the timestamp of the current time using the currentTimestamp() function. However, I should point out that Flink will sort the data as it is not aware of the monotonic property of the function. Moreover, all of that will with a parallelism of 1 because there is no clause that would allow for partitioning.

Alternatively, you can also implement a user-defined scalar function that converts the index attribute into a timestamp (effectively a Long value). But again, Flink will do a full sort of the data.

Fabian Hueske
  • 18,707
  • 2
  • 44
  • 49
  • Thank you for answering! So in this case do you suggest that I should add an index column and convert it to long? Right now I am using zipWithIndex() in DataSet to apply index and assign group number to each row using self-defined flatMap function – Jin.J Jun 28 '18 at 04:09
  • I you are coming from the DataSet API, you might want to have a look at the `MapPartitionFunction` which can be used for this computation as well and would avoid the full sort. In that case, you would not need the Table API. – Fabian Hueske Jun 28 '18 at 07:40
  • Hi, I have tried to convert my groupNumber from INTEGER to LONG and tried to use it as the time attribute in window. However I get this exception: Exception in thread "main" java.lang.UnsupportedOperationException: Count sliding group windows on event-time are currently not supported. Could you please suggest if I have written something wrong? I will post my code in the end of the question. Thank you – Jin.J Jul 17 '18 at 04:51
  • Oh, yes it seems that event-time count windows are not yet supported for batch Table API / SQL queries. I'd recommend to implement this using a `MapPartitionFunction` which should be much more efficient if the data is already sorted. – Fabian Hueske Jul 17 '18 at 11:33