8

I have an application for downloading specific web-content, from a stream of URL's generated from 1 Kafka-producer. I've created a topic with 5 partitions and there are 5 kafka-consumers. However the timeout for the webpage download is 60 seconds. While one of the url is getting downloaded, the server assumes that the message is lost and resends the data to different consumers.

I've tried everything mentioned in

Kafka consumer configuration / performance issues

and

https://github.com/spring-projects/spring-kafka/issues/202

But I keep getting different errors everytime.

Is it possible to tie a specific consumer with a partition in kafka? I am using kafka-python for my application

ashdnik
  • 648
  • 3
  • 8
  • 20

3 Answers3

5

I missed on the documentation of Kafka-python. We can use TopicPartition class to assign a specific consumer with one partition.

http://kafka-python.readthedocs.io/en/master/

>>> # manually assign the partition list for the consumer
>>> from kafka import TopicPartition
>>> consumer = KafkaConsumer(bootstrap_servers='localhost:1234')
>>> consumer.assign([TopicPartition('foobar', 2)])
>>> msg = next(consumer)
ashdnik
  • 648
  • 3
  • 8
  • 20
3

I have never used the Python client but the Java one supports the assign method that you can use instead of the subscribe for asking to be assigned specific partitions for the topic. Of course, you lose the auto-rebalancing feature and you have to handle these use case manually.

ppatierno
  • 9,431
  • 1
  • 30
  • 45
1

Maybe I guess what really happens in your case. If your consumer fetch the url from kafka, and then go to download the content, and you said that it cost about 60s to do it. So your consumer block it because the downloading, and could not send heartbeat to kafka server. So kafka server think this consumer is down, so it do a group rebalance, and resend the uncommited message to other comsumers.

So there are two solutions you could try:

  1. set the configs session_timeout_ms to 60000 or bigger. the default is 30s, it is not enough for you.

  2. A better solution is using multithreading to do it. when your consumer fetch a message from kafka, and then start a new thread to download the content, it will not block the consumer.poll, so it can work well.

GuangshengZuo
  • 4,447
  • 21
  • 27
  • I tried the first solution you mentioned. As you said I was getting error of _ CommitFailedError: Commit can't be completed since the group has already rebalanced & assigned the partitions to another member. This means that time b/w subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time msg processing. You can address this either by increasing the session timeout or by reducing the max size of batches returnd in poll() with max.poll.records._ I will try the second solution and update – ashdnik Aug 30 '17 at 05:49