Some of my Kafka consumers (but not all) show an interesting pattern regarding their lag.
The following image shows two good examples:
dark-blue:
- about 200 messages per second in topic
- 32 partitions
- 1 consumer in group (Python client, running on Kubernetes)
light-blue (same topic as dark-blue):
- so also about 200 messages per second in topic
- so also 32 partitions
- 1 consumer in group (also a Python client, running on Kubernetes)
brown:
- about 1500 messages per second in topic
- 40 partitions
- 2 consumers in group (Java/Spring client, running on Kubernetes)
Both sawtoothy clients can handle much larger throughput than that (tested by pausing, resuming and letting them catch up), so they are not working on their limits.
Rebalancing does happen sometimes (according to the logs) but much less often than the jumps in the diagram, and the few events also don't correlate in time with the jumps.
The messages also do not come in batches. Here is the additional information for one of the affected topics:
Where can this pattern originate from?