0

REF : Restarting a Kafka (python) consumer consumes all the messages in the queue again

I'm new to kafka and am trying to deal with offsets management too.

Using last release of Apache-Kafka (0.8.1.1.) with kafka-python 0.9.2 installed from pypi (last upload on 2014-08-27) which is different from the current master branch on github.

When testing with "SimpleConsumer" => crashing and restarting the script consumes messages from the last known offset.

When testing with "MultiProcessConsumer" => crashing and restarting the script restarts consuming from offset "0"

my little script (MultiProcessConsumer):

from kafka import KafkaClient, MultiProcessConsumer
KFK = KafkaClient("localhost:9092")
consumer = MultiProcessConsumer(KFK, "my-group1", "my-topic", num_procs=2)

I can check offsets by:

consumer.offsets
{0: 0, 1: 0}

Then, I run:

A = consumer.get_messages(count=1235)
consumer.offsets
{0: 1235, 1: 0}

After crashing and restarting the script again, first call of "consumer.offsets" returns "{0: 1235, 1: 0}" which is good. But running :

A.consumer.get_messages(count=388)
consumer.offsets
{0: 388, 1: 0}

Any idea about how to deal with this problem ? Moreover, is there anyway to alter properly the MultiProcessConsumer offsets to start from a defined position ?

Thanks for your help.

Edit : After diving into kafka-python lib source and checking issues on GitHub, see : https://github.com/mumrah/kafka-python/issues/173

So the problem is that when master multiprocessconsumer starts sub-process, it inits their offsets to "0" on each partition of the topic (because of autocommit set to false for sub-processes) instead of giving them the right values.

See "mahall" comment on GitHub.

Community
  • 1
  • 1
PyName
  • 1
  • 2

1 Answers1

0

It depends on the way consumer requests kafka broker for offset. Most likely you are doing something equivalent to this in Java

readOffset = getLastOffset(consumer,topic, partition, kafka.api.OffsetRequest.EarliestTime(), clientName);

try something like this

readOffset = getLastOffset(consumer,topic, partition, kafka.api.OffsetRequest.LatestTime(), clientName);
zero
  • 701
  • 2
  • 8
  • 13