We have a topic with 6 partitions. Traffic is not excessive by Kafka terms. We are using a 3-broker cluster in AWS MSK. We are using the Confluent Kafka package for .NET.
When we consume the topic, I take the difference between the produce timestamp (Kafka message Timestamp metadata) and the consume time (taken in our code immediately after the consumer.Consume). Accounting for time drift between the servers, the relative time diffence should be constant. But periodically see an unexpected large difference, sometimes up to 60 or more seconds.
We have setup a variety of test scenarios. I created a test consumer and eliminated all processing of the message. I simply consume the message, then spawn off a thread to do the time diff and write it to stdout. I have tested a single consumer, as well as 6 consumers on separate threads in the same consumer group. We still periodically see the delay. During the delay, I use a Kafka UI tool and notice only minimal Lag indicated--usually 20-200.
I also wrote a test producer that produces messages to a different topic, also with 6 partitions. This is done a rate much faster than our live system. The test consumer is able to consume from that topic without ever seeing the delay. Furthermore, the average diff time is about .02 seconds there, whereas in the live topic the average diff time varies between 0.02 and 1 second.
Dumping some data, one thing we noticed is the produce timestamps do not always increase in step with the offset. With that in mind I want to also consider produce issues. I'm definitely not considered well-versed in Kafka configurations.
Any other suggestions for things to look for?