I am developing a DataStream
-based Flink application for a high volume streaming use case (tens of millions of events per second). The data is consumed from a Kafka topic and is already sharded according to a certain key. My intention is to create key-specific states on the Flink side to run custom analytics. The main problem that I can't wrap my head around is how to create the keyed states without reshuffling of the incoming data that is imposed by keyBy()
.
I can guarantee that the maximum parallelism of the Flink job will be less than or equal to the number of partitions in the source Kafka topic, so logically the shuffling is not necessary. The answer to this StackOverflow question suggests that it may be possible to write the data into Kafka in a way that is compatible with the expectations of Flink and then use reinterpretAsKeyedStream()
. I would be happy to do it for this application. Would someone be able to share the necessary steps?
Thank you in advance.