0

Source: Kinesis data stream

Sink: Elasticesearch

For both using AWS services.

Also, running my Flink job on AWS Kinesis data analytics application

I am facing an issue with the windowing function of flink. My job looks like this

DataStream<TrackingData> input = ...; // input from kinesis stream
input.keyBy(e -> e.getArea())
                .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
                .reduce(new MyReduceFunction(), new MyProcessWindowFunction())
                .addSink(<elasticsearch sink>);
 private static class MyReduceFunction implements ReduceFunction<TrackingData> {
        @Override
        public TrackingData reduce(TrackingData trackingData, TrackingData t1) throws Exception {
            trackingData.setVideoDuration(trackingData.getVideoDuration() + t1.getVideoDuration());
            return trackingData;
        }
    }
private static class MyProcessWindowFunction extends ProcessWindowFunction<TrackingData, TrackingData, String, TimeWindow> {
        public void process(String key,
                            Context context,
                            Iterable<TrackingData> in,
                            Collector<TrackingData> out) {

            TrackingData trackingIn = in.iterator().next();

            Long videoDuration =0l;
            for (TrackingData t: in) {
                videoDuration += t.getVideoDuration();
            }
            trackingIn.setVideoDuration(videoDuration);
            out.collect(trackingIn);
        }
    }

sample event :

{"area":"sessions","userId":4450,"date":"2021-12-03T11:00:00","videoDuration":5} 

What I do here is from the kinesis stream I got these events in a large amount I want to sum videoDuration for every 10 seconds of window then I want to store this single event into elasticsearch.

In Kinesis there can be 10,000 events per second. I don't want to store all 10,000 events in elasticsearch i just want to store only one event for every 10 seconds.

The issue is when I send an event to this job it quickly processes this event and directly sinks into elasticsearch but I want to achieve : till every 10 seconds I want events videoDuration time to be incremented and after 10 seconds only one event to be store in elasticearch.

How can I achieve this?

Rohit
  • 97
  • 10

1 Answers1

2

I think you've misdiagnosed the problem.

The code you've written will produce one event from each 10-second-long window for each distinct key that has events during the window. MyProcessWindowFunction isn't having any effect: since the window results have been pre-aggregated, each Iterable will contain exactly one event.

I believe you want to do this instead:

input.keyBy(e -> e.getArea())
                .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
                .reduce(new MyReduceFunction())
                .windowAll(TumblingProcessingTimeWindows.of(Time.seconds(10)))
                .reduce(new MyReduceFunction())
                .addSink(<elasticsearch sink>);

You could also just do

input.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(10)))
                .reduce(new MyReduceFunction())
                .addSink(<elasticsearch sink>);

but the first version will be faster, since it will be able to compute the per-key window results in parallel before computing the global sum in the windowAll.

FWIW, the Table/SQL API is usually a better fit for this type of application, and should produce a more optimized pipeline than either of these.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • Thank you! David, the first solution worked. but i don't understand why we need 2 reduce and 2 window functions can you please elaborate? – Rohit Dec 14 '21 at 06:21
  • The first layer of windowing is separately computing the sum for each key (and doing so in parallel), while the second reducing window is computing the overall sum across all keys (which cannot be done in parallel). – David Anderson Dec 14 '21 at 08:54
  • Thank you, David. I have one more question regarding this question: https://stackoverflow.com/questions/70411797/flink-processing-event-too-slow would you mind answering this as well? – Rohit Dec 19 '21 at 13:29