0

I am trying to read a kafka topic from the earliest offset and then tombstone certain records through a python script. Since the messages are huge in number (Million +), I want to leverage multiprocessing to make the script faster while consuming the messages. Here's a snippet from the script:

    from kafka import KafkaConsumer

    def cleanup_kafka_topic(self, env):
    # Declarations
    consumer = KafkaConsumer(<topic_name>, group_id=<some_group>),
                             bootstrap_servers=[<kafka_host:kafka_port>],
                             auto_offset_reset='earliest', enable_auto_commit=True)
    # Clean-up logic
    for msg in consumer:
        # Do something with the msg

I am using Kafka-python.

Rajat Bhardwaj
  • 129
  • 1
  • 11

1 Answers1

1

The Kafka consumer is not thread safe (see section Thread safety here: https://pypi.org/project/kafka-python/). The way to speed things up is to have multiple partitions on your topic and scale up the number of consumers (all having the sae consumer group identifier). If you have N partitions, you can have up to N consumers (each partition can have at most 1 consumer). Kafka will take care of assigning and re-assigning partitions as your consumers go up or down so you can scale on demand (e.g., by observing the lag on a partition). Note as per docs this requires use of newer (0.9+) kafka brokers.

Roman Kutlak
  • 2,684
  • 1
  • 19
  • 24