1

Is it possible to create a KeyedStream from a pre-sharded/pre-partitioned Kinesis Data Stream without the need for a network shuffle (i.e. using reinterpretAsKeyedStream or something similar)?

  • If that is not possible (i.e. the only reliable is to consume from Kinesis and then use keyBy), then is network shuffling at least minimized by doing a keyBy on a the field that the source is sharded by (e.g. env.addSource(source).keyBy(pojo -> pojo.getTransactionId()), where the source is a kinesis data stream that is sharded by transactionId)
  • If the above is possible, what are the limitations?

What I've Learned so Far

Context of my Application

  • Re. configurations: both the Kinesis Data Stream and Flink will be hosted serverlessly, and automatically scale up/down depending on load (which as I understand it, means that reinterpretAsKeyedStream cannot be used)

Any help/insight is much appreciated, thanks!

r_g_s_
  • 224
  • 1
  • 8
  • I'm curious why you're worried about this. Are you encountering performance issues caused by the `.keyBy()` shuffle? – kkrugler Feb 09 '23 at 17:49
  • I am not encountering performance issues, but rather looking into steps to make my app as performant as possible. Logically, there should be no need for the shuffle to happen, so I was hoping this could be a way to boost performance – r_g_s_ Feb 09 '23 at 19:54

1 Answers1

2

I don't believe there's any way to easily do what you want, at least not in a way that's resilient to changes in the parallelism of your source and your cluster. I have used helicopter stunts to achieve something similar to this, but it involved fragile code (depends on exactly how Flink handles partitioning).

kkrugler
  • 8,145
  • 6
  • 24
  • 18