Can I avoid network shuffle when creating a KeyedStream from a pre-partitioned Kinesis Data Stream in Apache Flink?

Question

Is it possible to create a KeyedStream from a pre-sharded/pre-partitioned Kinesis Data Stream without the need for a network shuffle (i.e. using reinterpretAsKeyedStream or something similar)?

If that is not possible (i.e. the only reliable is to consume from Kinesis and then use keyBy), then is network shuffling at least minimized by doing a keyBy on a the field that the source is sharded by (e.g. env.addSource(source).keyBy(pojo -> pojo.getTransactionId()), where the source is a kinesis data stream that is sharded by transactionId)
If the above is possible, what are the limitations?

What I've Learned so Far

The functionality I am describing is already implemented by reinterpretAsKeyedStream, but this feature is experimental and seems to have significant drawbacks (as per discussions in the stackoverflow posts below)
In addition to the above, all the discussions related to reinterpretAsKeyedStream that I've found are in the context of Kafka, so I'm not sure how the outcomes differ for a Kinesis Data Stream

Context of my Application

Re. configurations: both the Kinesis Data Stream and Flink will be hosted serverlessly, and automatically scale up/down depending on load (which as I understand it, means that reinterpretAsKeyedStream cannot be used)

Any help/insight is much appreciated, thanks!

I'm curious why you're worried about this. Are you encountering performance issues caused by the `.keyBy()` shuffle? — kkrugler, Feb 09 '23 at 17:49
I am not encountering performance issues, but rather looking into steps to make my app as performant as possible. Logically, there should be no need for the shuffle to happen, so I was hoping this could be a way to boost performance — r_g_s_, Feb 09 '23 at 19:54

score 2 · Accepted Answer · answered Feb 09 '23 at 17:49

I don't believe there's any way to easily do what you want, at least not in a way that's resilient to changes in the parallelism of your source and your cluster. I have used helicopter stunts to achieve something similar to this, but it involved fragile code (depends on exactly how Flink handles partitioning).

Can I avoid network shuffle when creating a KeyedStream from a pre-partitioned Kinesis Data Stream in Apache Flink?

1 Answers1