Pyspark : Kafka consumer for multiple topics

Question

I have a list of topics (for now it's 10) whose size can increase in future. I know we can spawn multiple threads (per topic) to consume from each topic, but in my case if the number of topics increases, then the number of threads consuming from the topics increases, which I do not want, since the topics are not going to get data too frequently, so the threads will sit idle.

Is there any way to have a single consumer to consume from all topics? If yes, then how can we achieve it? Also how will the offset be maintained by Kafka? How to write in python?

score 0 · Answer 1 · answered Jan 21 '23 at 14:52

Programming language isn't relevant.

Simply set the number of executors to 1 when you submit the Spark job.

However, that'll be slower than submitting as many as you truly need, so I'm not sure why you'd do that.

if the number of topics increases, then the number of threads consuming from the topics increases

This isn't true. Your upper limit is number of executors * cores-per-executor.

Also, threads would be used per partition of each topic, not one per topic

Pyspark : Kafka consumer for multiple topics

1 Answers1