0

I'm performance benchmarking my Flink application that reads data from Kafka, transforms it and dumps it into another Kafka topic. I need to keep the context so messages with same order-id are not treated as brand new orders. I'm extending RichFlatMapFunction class with ValueState to achieve that. As I understand, I'll need to use KeyStream before I can call flatMap:

env.addSource(source()).keyBy(Order::getId).flatMap(new OrderMapper()).addSink(sink());

The problem is keyBy is taking very long time from my prespective (80 to 200 ms). I say keyBy is taking because if I remove keyBy and replace flatMap with a map function, 90th percentile of latency is about 1ms. Is there a way to use state/context without using keyBy or maybe make keyBy fast somehow?

Abidi
  • 7,846
  • 14
  • 43
  • 65
  • I dont think using orderID as a key is good idea. How many records you need to process per minute and for how long you should keep an orderId in a storage for filtering it? Can we say, it is safe to remove orderID from a lookup table after 1 day because we are sure this orderId will not appear again? – Kenank Dec 01 '22 at 08:19
  • We need to be prepared to receive 50K in a minute. Regarding, if order would disappear every 24 hours, we've two flows, in first flow orderId won't last more than 10 seconds and in the second case an orderId could be used between 1-5 days. – Abidi Dec 01 '22 at 10:03
  • @Kenank you raised a good point here, how to decide what id to use in keyBy. My understanding is, the whole is to be able to retrieve state by that id? – Abidi Dec 01 '22 at 10:04

1 Answers1

0

The keyBy is expensive because it requires a network shuffle -- every record is serialized, sent to the downstream instance responsible for that key, and then deserialized.

For the pipeline you've described, this is unavoidable. But your choice of serializer can make a big difference.

For more ideas about how to reduce latency, see Flink optimal configuration for minimum Latency.

As for the choice of key, if you need to deduplicate by orderId, then you'll have to key by the orderId.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • I tried setting BufferTimeout to 1ms and my 99th percentile dropped to 1.2ms. Quite surprised with such a significant change though I need to understand what's the cost of it. How to decide what Id to use in KeyBy, I'm choosing orderId, just because it's unique. – Abidi Dec 01 '22 at 10:15
  • In one benchmark, reducing the network buffer timeout to 1ms reduced throughput by about 25%. But results can vary quite a bit, so it's a good idea to make your own measurements. – David Anderson Dec 01 '22 at 20:17
  • https://stackoverflow.com/questions/63954164/flink-optimal-configuration-for-minimum-latency/63956701#63956701 has more ideas you may find helpful. – David Anderson Dec 01 '22 at 20:18