I am using Kinesis data stream as a source and elasticsearch as a sink. Running Flink job in AWS Kinesis Data analytics application.
Sample event :
{"area":"sessions","userId":4450,"date":"2021-12-03T11:00:00","videoDuration":5}
I am collecting these video watching events from the front-end while the video is playing every 5 seconds for one user. These events are used to calculate the watch time of a user.
Let's say If one user is watching a video then every 5 seconds this event is generated from the front-end and ingested into the Kinesis data stream. So there are 10,000 users watching a video so in one minute total of 120,000 events are generated.
For processing 120,000 events my Flink job nearly takes ~4 minutes of time. This is quite a long time.
So how can I improve the performance of the job? I need to achieve this in 1 minute.
My job looks like this :
stream
.keyBy(e -> e.getUserId())
.timeWindow(Time.seconds(60))
.reduce(new MyReduceFunction()) //sum of video duration for user
.map(<enrich event using some data from redis>)
.addSink(<elasticsearch sink>);
// Reduce function
private static class MyReduceFunction implements ReduceFunction<TrackingData> {
@Override
public TrackingData reduce(TrackingData trackingData, TrackingData t1) throws Exception {
trackingData.setVideoDuration(trackingData.getVideoDuration() + t1.getVideoDuration());
return trackingData;
}
}
So what this job is doing first receiving events from Kinesis Data stream then I key by this stream by userId
then I do some of videoDuration
for 1 minute then this data goes to enrichment function in which I read some data from Redis and enrich this event then i sink this event to elasticsearch.
I have tried with increasing parallelism of job it is giving best performance for 1 parallelism which is ~4 minutes. If I increase parallelism it's taking more time it's quite strange. Tried with 2, 4, 8, 16, etc. Increasing parallelism should give more speedy processing isn't it so?
Can anyone help what I am missing or what I am doing wrong with this Flink job, What do I need to do to speed up these events in 1 min?