Kafka Streams: event-time skew when processing messages from different partitions

Question

Let's consider a topic with multiple partitions and messages written in event-time order without any particular partitioning scheme. Kafka Streams application does some transformations on these messages, then groups by some key, and then aggregates messages by an event-time window with the given grace period.

Each task could process incoming messages at a different speed (e.g., because running on servers with different performance characteristics). This means that after groupBy shuffle, event-time ordering will not be preserved between messages in the same partition of the internal topic when they originate from different tasks. After a while, this event-time skew could become larger than the grace period, which would lead to dropping messages originating from the lagging task.

Increasing the grace period doesn't seem like a valid option because it would delay emitting the final aggregation result. Apache Flink handles this by emitting the lowest watermark on partitions merge.

Should it be a real concern, especially when processing large amounts of historical data, or do I miss something? Does Kafka Streams offer a solution to deal with this scenario?

UPDATE My question is not about KStream-KStream joins but about single KStream event-time aggregation preceded by a stream shuffle.

Consider this code snippet:

stream
  .mapValues(...)
  .groupBy(...)
  .windowedBy(TimeWindows.of(Duration.ofSeconds(60)).grace(Duration.ofSeconds(10)))
  .aggregate(...)

I assume mapValues() operation could be slow for some tasks for whatever reason, and because of that tasks do process messages at a different pace. When a shuffle happens at the aggregate() operator, task 0 could have processed messages up to time t while task 1 is still at t-skew, but messages from both tasks end up interleaved in a single partition of the internal topic (corresponding to the grouping key).

My concern is that when skew is large enough (more than 10 seconds in my example), messages from the lagging task 1 will be dropped.

fhussonnois · Accepted Answer · 2020-11-16T13:56:01.873

Basically, a task/processor maintains a stream-time which is defined as the highest timestamp of any record already polled. This stream-time is then used for different purpose in Kafka Streams (e.g: Punctator, Windowded Aggregation, etc).

[Windowed Aggregation]

As you mentioned, the stream-time is used to determine if a record should be accepted, i.e record_accepted = end_window_time(current record) + grace_period > observed stream_time.

As you described it, if several tasks run in parallel to shuffle messages based on a grouping key, and some tasks are slower than others (or some partitions are offline) this will create out-of-order messages. Unfortunately, I'm afraid that the only way to deal with that is to increase the grace_period.

This is actually the eternal trade-off between Availability and Consistency.

[Behaviour for KafkaStream and KafkaStream/KTable Join

When you are perfoming a join operation with Kafka Streams, an internal Task is assigned to the "same" partition over multiple co-partitioned Topics. For example the Task 0 will be assigned to Topic1-Partition0 and TopicB-Partition0.

The fetched records are buffered per partition into internal queues that are managed by Tasks. So, each queue contains all records for a single partition waiting for processing.

Then, records are polled one by one from queues and processed by the topology instance. But, this is the record from the non-empty queue having the lowest timestamp which is returned from the polled.

In addition, if a queue is empty, the task may become idle during a period of time so that no more records are polled from queue. You can actually configure the maximum amount of time a Task will stay idle can be defined with the stream config :max.task.idle.ms

This mecanism allows synchronizing co-localized partitions. Bu, default the max.task.idle.ms is set to 0. This means a Task will never wait for more data from a partition which may lead to records being filtered because the stream-time will potentially increase more quickly.

Thank you for the explaination. My question is not about KStream-KStream joins but about single KStream event-time aggregation preceded by a stream shuffle. I've updated the question to better describe this scenario. — Boris Sukhinin, Nov 16 '20 at 12:58
Okay, I get it. I've updated my answer but unfortunately I don't think there is any solution for that (I'm still thinking...) — fhussonnois, Nov 16 '20 at 13:58

Kafka Streams: event-time skew when processing messages from different partitions

1 Answers1