Kafka Latency is linearly increasing

Question

I'm trying benchmark Apache Kafka and at the moment i am trying to measure the latency. In order to do this, i am producing 10 MB of tweet data (~14500 tweets) to not lose an overview about whats going on, but i can basically send arbitrarily much later. So those 10 MBs are only for testing purposes for now.

I have configured the Broker to use LogAppendTime timestamps, which i am comparing to the current time on the Consumer side. While doing this, i always get latency values that are increasing with the time (see details below) and this is not what i have expected.

Setup

VM with Ubuntu 20
2 VPUs
9GB RAM
confluent-kafka and Apache Kafka in most recent version
1 Partition, 1 Kafka Broker, all default settings (except for the LogAppendTime setting)

Producer and Consumer and both running on the same machine (at least for now). VM is running in a data center, so bandwidth should be no problem.

Constants

CONSUMER_GROUP_ID = "twitter-consumers"
KAFKA_TOPIC_TWITTER = "twitter-stream"

Producer Code

producer_config = {
    "bootstrap.servers": "localhost:9092"
}
p = Producer(producer_config)

generation_steps = 1 # 10 MBs per step
with open(TWITTER_DATA_PATH, "r") as dataset:
    for step in range(generation_steps):
        print(f"Executing data generation step {step}...")
        dataset.seek(0) # Jump back to first line  
 
        for tweet in dataset:
            try:
                print("IN QUEUE: {}".format(len(p)))
                p.produce(KAFKA_TOPIC_TWITTER, value=tweet)
                p.poll(0)
            except BufferError:
                print('[INFO] Local producer queue is full (%d messages awaiting delivery): Trying again after flushing...\n' % len(p)) 
                p.flush() # Send the messages in the queue

                # The tweet that caused the BufferError could not be sent, so after flushing, we can now send it
                p.produce(KAFKA_TOPIC_TWITTER, value=tweet)
                p.poll(0)

print("Data generation done!")

Consumer Code

# Consume the data
consumer_config = {
    "bootstrap.servers": "localhost:9092",
    "group.id": CONSUMER_GROUP_ID,
    "client.id": "client-1",
    "enable.auto.commit": True,
    "session.timeout.ms": 6000,
    "default.topic.config": {"auto.offset.reset": "smallest"}
}
c = Consumer(consumer_config)

c.subscribe([KAFKA_TOPIC_TWITTER])

msg_count = 0
try:
    while True:
        msg = c.poll(0.1)
        
        if msg is None:
            continue
        elif not msg.error():
            delta = (time.time() * 1000) - msg.timestamp()[1]
            msg_count += 1
            print("Message {} received | Timestamp: {} | Latency: {}".format(msg_count, msg.timestamp()[1], delta))
        elif msg.error().code() == KafkaError._PARTITION_EOF:
            print("End of partition reached {}/{}".format(msg.topic(), msg.partition()))
        else:
            print("Error occured: {}".format(msg.error().str()))
except KeyBoardInterrupt:
    pass
finally:
    c.close()

What I get

Consumer Side

Message 1 received | Timestamp: 1618685061270 | Latency: 12.360107421875
Message 2 received | Timestamp: 1618685061270 | Latency: 13.025390625
Message 3 received | Timestamp: 1618685061270 | Latency: 13.08544921875
Message 4 received | Timestamp: 1618685061270 | Latency: 13.156005859375
Message 5 received | Timestamp: 1618685061270 | Latency: 13.1923828125
Message 6 received | Timestamp: 1618685061270 | Latency: 13.21875
Message 7 received | Timestamp: 1618685061270 | Latency: 13.245849609375
Message 8 received | Timestamp: 1618685061270 | Latency: 13.29833984375
Message 9 received | Timestamp: 1618685061270 | Latency: 13.348388671875
Message 10 received | Timestamp: 1618685061270 | Latency: 13.389892578125
Message 11 received | Timestamp: 1618685061270 | Latency: 13.43701171875
Message 12 received | Timestamp: 1618685061270 | Latency: 13.49755859375
Message 13 received | Timestamp: 1618685061270 | Latency: 13.54345703125
Message 14 received | Timestamp: 1618685061270 | Latency: 13.581787109375
Message 15 received | Timestamp: 1618685061270 | Latency: 13.626708984375
Message 16 received | Timestamp: 1618685061270 | Latency: 13.661376953125
Message 17 received | Timestamp: 1618685061270 | Latency: 13.695556640625
Message 18 received | Timestamp: 1618685061270 | Latency: 13.720703125
Message 19 received | Timestamp: 1618685061270 | Latency: 13.75537109375
Message 20 received | Timestamp: 1618685061270 | Latency: 13.78857421875
Message 21 received | Timestamp: 1618685061270 | Latency: 13.81396484375
[....]
Message 14485 received | Timestamp: 1618685061337 | Latency: 1061.004150390625

Producer side

Executing data generation step 0...
IN QUEUE: 0
IN QUEUE: 1
IN QUEUE: 2
IN QUEUE: 3
IN QUEUE: 4
IN QUEUE: 5
IN QUEUE: 6
IN QUEUE: 7
IN QUEUE: 8
IN QUEUE: 9
IN QUEUE: 10
IN QUEUE: 11
IN QUEUE: 12
IN QUEUE: 13
IN QUEUE: 14
IN QUEUE: 15
IN QUEUE: 16
[...]
IN QUEUE: 14484
Data generation done!

What i would expect

As you can see, i am comparing the timestamp using delta = (time.time() * 1000) - msg.timestamp()[1]. I would expect that the latency varies from 10 - 100ms at most. I don't understand why the latency increases "linearly" with the time.

Also, i dont understand why the buffer queue gets so large, considering that Kafka should clear the queue after each .poll() call as far as i understood.

Does anybody know what could be causing this and what i could do to about it?

score 0 · Answer 1 · answered Apr 18 '21 at 13:18

I found the solution, which is pretty obvious: Printing in Python takes a lot of time, especially if you do it thousands and thousands of times...

After moving the printing outside of the consumer loop and evaluating the latencies there, it varies between ~2 and ~30ms.