3

I've seen this in production once (I don't remember how we solved it) and now I can repeat it in the integration tests, which always start with a brand new Kafka installation. Here's how it goes:

Step 1: A consumer of a group that doesn't exist yet subscribes to a topic that does not exist yet and starts polling.

self.kafka_consumer = confluent_kafka.Consumer({
    'group.id': 'mygroup',
    'bootstrap.servers': 'kafka:9092',
    'enable.auto.commit': False,
    'auto.offset.reset': 'earliest',
})
self.kafka_consumer.subscribe('mytopic')

Step 2: A producer writes a message to the topic.

Result:

  • About half the times it works fine; the consumer reads the message alright.
  • The other half times the consumer seems stuck. I've tried waiting times up to 10 minutes to see if it would get unstuck, but no.
  • Even if the two steps are reversed, i.e. the consumer tries to subscribe to an already existing topic that already has a message, the behavior is the same (however the group is always new).

More details

The consumer is polling with a timeout of 2 seconds, and if there's no result it loops over.

While the topic doesn't exist, poll() returns None. After the topic exists, poll() returns an msg whose error().code() is _PARTITION_EOF.

While the consumer seems stuck, I ask kafka what's going on with mygroup, and here's what it tells me:

root@e7b124b4039c:/# /usr/local/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group mygroup --describe
Note: This will not show information about old Zookeeper-based consumers.


TOPIC                          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG        CONSUMER-ID                                       HOST                           CLIENT-ID
root@e7b124b4039c:/#

I try to make it unstuck by trying to read another nonexistent topic as mygroup:

root@e7b124b4039c:/# /usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --group mygroup --topic nonexistent --from-beginning
[2018-03-15 16:36:59,369] WARN [Consumer clientId=consumer-1, groupId=pixelprocessor] Error while fetching metadata with correlation id 2 : {nonexistent=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
^CProcessed a total of 0 messages
root@e7b124b4039c:/#

After I do that, here's what Kafka has to say about mygroup:

root@e7b124b4039c:/# /usr/local/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group mygroup --describe
Note: This will not show information about old Zookeeper-based consumers.


TOPIC                          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG        CONSUMER-ID                                       HOST                           CLIENT-ID
mytopic                        0          -               1               -          rdkafka-a172d013-08e6-4ee2-92f3-fdb07d163d57      /172.20.0.6                    rdkafka
(another topic)                0          -               0               -          rdkafka-a172d013-08e6-4ee2-92f3-fdb07d163d57      /172.20.0.6                    rdkafka
(a third topic)                0          -               0               -          rdkafka-a172d013-08e6-4ee2-92f3-fdb07d163d57      /172.20.0.6                    rdkafka
nonexistent                    0          0               0               0          -                                                 -                              -

This is Kafka 1.0.1, librdkafka 0.11.3, confluent_kafka 0.11.0, on Ubuntu 16.04 dockers (with the OS's packaged zookeeper 3.4.8) which are running on a Debian stretch (9.4) with Linux 4.9.0-6-amd64.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Antonis Christofides
  • 6,990
  • 2
  • 39
  • 57
  • Re consumer lag: librdkafka will (currently) only commit offsets for messages it has seen, not when it reaches the end of a partition (PARTITION_EOF) without having consumed at least one message. That's why nothing is showing up in the consumer group describe when no messages were consumed. – Edenhill Mar 20 '18 at 20:17
  • If you restart the consumer after its first run, does it pick up the message on second run? – Edenhill Mar 20 '18 at 20:17
  • @Edenhill AFAICS no. I've restarted it five times and waited (from a few seconds to a couple of minutes); nothing happens. – Antonis Christofides Mar 21 '18 at 08:08
  • Meanwhile a workaround that seems to work: Before starting the real consumer process, the integration test creates (using `kafka-python`) a consumer belonging to the (yet nonexisting) real consumer's group, briefly polls the (yet nonexisting) topic, and is then closed. After that the real consumer process is started and it always works. – Antonis Christofides Mar 21 '18 at 11:52

1 Answers1

1

The problem seems to have been in the Consumer() arguments. This doesn't work properly:

self.kafka_consumer = confluent_kafka.Consumer({
    'group.id': 'mygroup',
    'bootstrap.servers': 'kafka:9092',
    'auto.offset.reset': 'earliest',
})

But this does:

self.kafka_consumer = confluent_kafka.Consumer({
    'group.id': 'mygroup',
    'bootstrap.servers': 'kafka:9092',
    'default.topic.config': {
        'auto.offset.reset': 'earliest',
    },
})
Antonis Christofides
  • 6,990
  • 2
  • 39
  • 57