This is probably an eisenbug so I'm not expecting hard answers but more hints on what to look for to be able to replicate the bug.
I have an event-driven, kafka-based system composed of several services. For now, they are organized in linear pipelines. One topic, one event type. Every service can be thought as a transformation from one event type to one or more event types.
Each transformation is executed as a python process, with its own consumer and its own producer. They all share the same code and configuration because this is all abstracted away from the service implementation.
Now, what's the problem. On our staging environment, sometimes (let's say one in every fifty messages), there's a message on Kafka but the consumer is not processing it at all. Even if you wait hours, it just hangs. This doesn't happen in local environments and we haven't been able to reproduce it anywhere else.
Some more relevant information:
- these services get restarted often for debugging purposes but the hanging doesn't seem related to the restarting.
- When the message is hanging and you restart the service, the service will process the message.
- The services are completely stateless so there's no caching or other weird stuff going on (I hope)
- When this happens I have evidence that they are not still processing older messages (I log when they produce an output and this happens right before the end of the consumer loop)
- With the current deployment there's just a single consumer in the consumer group, so no parallel processing inside the same services, no horizontal scaling of the service
How I consume:
I use pykafka and this is the consumer loop:
def create_consumer(self):
consumer = self.client.topics[bytes(self.input_topic, "UTF-8")].get_simple_consumer(
consumer_group=bytes(self.consumer_group, "UTF-8"),
auto_commit_enable=True,
offsets_commit_max_retries=self.service_config.kafka_offsets_commit_max_retries,
)
return consumer
def run(self):
consumer = self.create_consumer()
while not self.stop_event.wait(1):
message = consumer.consume()
results = self._process_message(message)
self.output_results(results)
My assumption is that there's either some problem with the way I consume the messages or there's some inconsistent state of the consumer group offsets, but I cannot really wrap my mind around the problem.
I'm also considering to move to Faust to solve the problem. Given my codebase and my architectural decision, the transition shouldn't be too hard but before starting such a work I would like to be sure that I'm going in the right direction. Right now it would just be a blind shot hoping that some detail that is creating the problem will go away.