0

I'm using the Apache Flink Streaming API through to process a data file and I'm interested in getting only the results from the last of the windows. Is there a way to do this? If it is not possible, I thought I could filter through the maximum of the first field in the resulted tuple (the Long value).

SingleOutputStreamOperator<Tuple12<Long, String, String, Integer, String, Integer, String, Integer, String, Integer, String, Integer>> top5SlidingEventTimeWindowsFiltered = top5SlidingEventTimeWindows.filter(new FilterFunction<Tuple12<Long,String,String,Integer,String,Integer,String,Integer,String,Integer,String,Integer>>() {

        public boolean filter(
                Tuple12<Long, String, String, Integer, String, Integer, String, Integer, String, Integer, String, Integer> value)
                throws Exception {
            …

        }
    });

In the above filtering transformation it would be to filter by the maximum value of the first field of the tuple. Is it possible to do that somehow?

ekth0r
  • 65
  • 5

1 Answers1

1

With the DataStream API, when you are consuming data from a finite source (like a file), when the source reaches the end of its input it sends a watermark with the value MAX_WATERMARK. You can use this to detect that the job is done.

So in a case like yours, you can put a ProcessFunction after the windows, and have it continuously storing in state the latest results it has received so far. Set a timer for MAX_WATERMARK, and when it arrives, use what's then in state to produce the desired result.

This will have to be a KeyedProcessFunction, because otherwise you can't use timers. If the stream isn't keyed, you'll have to key it anyway -- you can simply key by a constant, assuming you don't mind having a parallelism of one.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • And how can I set a timer for MAX_WATERMARK? – ekth0r Jun 18 '20 at 19:57
  • Register the timer in the `processElement` method of a process function via `ctx.timerService().registerEventTimeTimer(Watermark.MAX_WATERMARK)`. Timers are de-duplicated, so you can call this for every event if you like. The `onTimer` will be called when the timer fires. – David Anderson Jun 19 '20 at 08:18