Is there a way I can read large number of messages from a kafka topic using multiprocessing in python?

Question

I am trying to read a kafka topic from the earliest offset and then tombstone certain records through a python script. Since the messages are huge in number (Million +), I want to leverage multiprocessing to make the script faster while consuming the messages. Here's a snippet from the script:

    from kafka import KafkaConsumer

    def cleanup_kafka_topic(self, env):
    # Declarations
    consumer = KafkaConsumer(<topic_name>, group_id=<some_group>),
                             bootstrap_servers=[<kafka_host:kafka_port>],
                             auto_offset_reset='earliest', enable_auto_commit=True)
    # Clean-up logic
    for msg in consumer:
        # Do something with the msg

I am using Kafka-python.

Because this is a kind of I/O operation, I think multithreading suits this case more than multiprocessing you can test it using `concurrent.futures` module — mirhossein, Feb 18 '21 at 16:56
You could use aiokafka if you want multiple asynchronous consumers — OneCricketeer, Feb 20 '21 at 15:09

score 1 · Answer 1 · answered Feb 19 '21 at 09:27

The Kafka consumer is not thread safe (see section Thread safety here: https://pypi.org/project/kafka-python/). The way to speed things up is to have multiple partitions on your topic and scale up the number of consumers (all having the sae consumer group identifier). If you have N partitions, you can have up to N consumers (each partition can have at most 1 consumer). Kafka will take care of assigning and re-assigning partitions as your consumers go up or down so you can scale on demand (e.g., by observing the lag on a partition). Note as per docs this requires use of newer (0.9+) kafka brokers.

Is there a way I can read large number of messages from a kafka topic using multiprocessing in python?

1 Answers1