0

We have a topic with 6 partitions. Traffic is not excessive by Kafka terms. We are using a 3-broker cluster in AWS MSK. We are using the Confluent Kafka package for .NET.

When we consume the topic, I take the difference between the produce timestamp (Kafka message Timestamp metadata) and the consume time (taken in our code immediately after the consumer.Consume). Accounting for time drift between the servers, the relative time diffence should be constant. But periodically see an unexpected large difference, sometimes up to 60 or more seconds.

We have setup a variety of test scenarios. I created a test consumer and eliminated all processing of the message. I simply consume the message, then spawn off a thread to do the time diff and write it to stdout. I have tested a single consumer, as well as 6 consumers on separate threads in the same consumer group. We still periodically see the delay. During the delay, I use a Kafka UI tool and notice only minimal Lag indicated--usually 20-200.

I also wrote a test producer that produces messages to a different topic, also with 6 partitions. This is done a rate much faster than our live system. The test consumer is able to consume from that topic without ever seeing the delay. Furthermore, the average diff time is about .02 seconds there, whereas in the live topic the average diff time varies between 0.02 and 1 second.

Dumping some data, one thing we noticed is the produce timestamps do not always increase in step with the offset. With that in mind I want to also consider produce issues. I'm definitely not considered well-versed in Kafka configurations.

Any other suggestions for things to look for?

Matthew Allen
  • 538
  • 2
  • 7
  • 14
  • Have you looked at how long it takes the producer takes to get a delivery callback? It is very possible that the producer is batching messages before sending data to Kafka (which is typically a good thing). This would be configured via `linger.ms` in the producer's configuration, which may be set differently in your test producer. – Chris Beard Aug 04 '23 at 00:54
  • I'll check the time to produce and read up on that setting--thx, Chris. – Matthew Allen Aug 04 '23 at 12:52
  • We're still researching both Produce and Consume sides of this issue. For ConsumerConfig, I see a fetch.max.wait.ms setting and fetch.min.bytes. I wonder if those can be used to tell the consume to quit waiting and hop to the next Partition if something is taking too long. – Matthew Allen Aug 04 '23 at 12:55

1 Answers1

0

After more testing and simulation, we discovered the problem was actually Producer lag.

We were using await ProduceAsync() to produce each message. Re-reading documentation, the Produce() call is more performant and suited for our system. After changing this, our tests showed each Producer increasing from 100 msg/sec to many thousand msg/sec.

We were also doing a .Flush() after producing each message. We removed that and only do the .Flush() on Dispose.

The reason the message timestamp metadata did not always increase in step with the offset was that timestamp type is set to CreateDate, so the value was set when we initially called ProduceAsync(). The offset isn't assigned until the message is actually inserted into the Kafka log.

Matthew Allen
  • 538
  • 2
  • 7
  • 14