Why does my apache kafka consumer randomly ignores queued messages?

Question

This is probably an eisenbug so I'm not expecting hard answers but more hints on what to look for to be able to replicate the bug.

I have an event-driven, kafka-based system composed of several services. For now, they are organized in linear pipelines. One topic, one event type. Every service can be thought as a transformation from one event type to one or more event types.

Each transformation is executed as a python process, with its own consumer and its own producer. They all share the same code and configuration because this is all abstracted away from the service implementation.

Now, what's the problem. On our staging environment, sometimes (let's say one in every fifty messages), there's a message on Kafka but the consumer is not processing it at all. Even if you wait hours, it just hangs. This doesn't happen in local environments and we haven't been able to reproduce it anywhere else.

Some more relevant information:

these services get restarted often for debugging purposes but the hanging doesn't seem related to the restarting.
When the message is hanging and you restart the service, the service will process the message.
The services are completely stateless so there's no caching or other weird stuff going on (I hope)
When this happens I have evidence that they are not still processing older messages (I log when they produce an output and this happens right before the end of the consumer loop)
With the current deployment there's just a single consumer in the consumer group, so no parallel processing inside the same services, no horizontal scaling of the service

How I consume:

I use pykafka and this is the consumer loop:

def create_consumer(self):

    consumer = self.client.topics[bytes(self.input_topic, "UTF-8")].get_simple_consumer(
        consumer_group=bytes(self.consumer_group, "UTF-8"),
        auto_commit_enable=True,
        offsets_commit_max_retries=self.service_config.kafka_offsets_commit_max_retries,
    )
    return consumer

def run(self):

    consumer = self.create_consumer()
    while not self.stop_event.wait(1):
        message = consumer.consume()
        results = self._process_message(message)
        self.output_results(results)

My assumption is that there's either some problem with the way I consume the messages or there's some inconsistent state of the consumer group offsets, but I cannot really wrap my mind around the problem.

I'm also considering to move to Faust to solve the problem. Given my codebase and my architectural decision, the transition shouldn't be too hard but before starting such a work I would like to be sure that I'm going in the right direction. Right now it would just be a blind shot hoping that some detail that is creating the problem will go away.

You have `auto_commit_enabled=True` this would commit the offset whether or not the message was processed properly. An exception or error would mean the event would not reach the destination. — shanmuga, Feb 07 '19 at 11:22
for sure, but this would be logged or crash the service. The problem is that the service stays up and the logs are clean. — Chobeat, Feb 07 '19 at 11:25
in another place they suggested to commit after self.output_results and I will do that because it's good design anyway and I've postponed this change for too long. — Chobeat, Feb 07 '19 at 11:26

Why does my apache kafka consumer randomly ignores queued messages?

0 Answers0