REF : Restarting a Kafka (python) consumer consumes all the messages in the queue again
I'm new to kafka and am trying to deal with offsets management too.
Using last release of Apache-Kafka (0.8.1.1.) with kafka-python 0.9.2 installed from pypi (last upload on 2014-08-27) which is different from the current master branch on github.
When testing with "SimpleConsumer" => crashing and restarting the script consumes messages from the last known offset.
When testing with "MultiProcessConsumer" => crashing and restarting the script restarts consuming from offset "0"
my little script (MultiProcessConsumer):
from kafka import KafkaClient, MultiProcessConsumer
KFK = KafkaClient("localhost:9092")
consumer = MultiProcessConsumer(KFK, "my-group1", "my-topic", num_procs=2)
I can check offsets by:
consumer.offsets
{0: 0, 1: 0}
Then, I run:
A = consumer.get_messages(count=1235)
consumer.offsets
{0: 1235, 1: 0}
After crashing and restarting the script again, first call of "consumer.offsets" returns "{0: 1235, 1: 0}" which is good. But running :
A.consumer.get_messages(count=388)
consumer.offsets
{0: 388, 1: 0}
Any idea about how to deal with this problem ? Moreover, is there anyway to alter properly the MultiProcessConsumer offsets to start from a defined position ?
Thanks for your help.
Edit : After diving into kafka-python lib source and checking issues on GitHub, see : https://github.com/mumrah/kafka-python/issues/173
So the problem is that when master multiprocessconsumer starts sub-process, it inits their offsets to "0" on each partition of the topic (because of autocommit set to false for sub-processes) instead of giving them the right values.
See "mahall" comment on GitHub.