How to improve spark consumption of messages from kafka

Question

I'm running a bigger spark cluster with 96 cores cluster, and polling messages from kafka. There are around 200 total partitions from different topics, which are being processed in continuous streaming job. The parameters i'm using against kafka options are:

spark.streaming.kafka.maxRatePerPartition: 8k maxOffsetsPerTrigger: 10Million

But I can see a performance degrade with recent runs. Spark cluster is not craving for any resources, it does has plenty of memory and cpu utilisation is only 30%. While looking into spark UI, there are just 1-5% of tasks which are taking longer to execute, rest are executing quicker. Data skewness is also visible.

On the executer side, the logs shows that spark is trying to read and reset the same partition offsets multiple times. Is it normal ? What can be done to optimize the performance of spark-kafka ? Is there any problem exists with Kafka cluster as well ? If so, then what are the things which needs to be verified on kafka side ?

df = ( spark.readStream.format("kafka") 
.option("kafka.bootstrap.servers", hosts) 
.option("subscribe", topics) 
.option("spark.streaming.kafka.maxRatePerPartition", 9000) 
.option("startingOffsets", "latest") 
.option("maxOffsetsPerTrigger", 10000000) 
.option("minPartitions", 400) 
.option("failOnDataLoss", False) 
.load() )

can you please share your code ? how many partitions your topic has ? — Srinivas, Jul 16 '23 at 07:53
There are total 10 Topics and each one has 20 partitions. Also I'm using ".option("minPartitions", 400)". — Sandie, Jul 16 '23 at 11:10
Thank You.. Can you please share spark ui after your execution of spark job — Srinivas, Jul 16 '23 at 13:30
If you have 10 topics, each with 20 partitions, there can only be 200 consumer threads in parallel, so why are you setting 400? Also, offset resets should be solved if you enable checkpointing, as explained in the Spark documentation — OneCricketeer, Jul 16 '23 at 14:41
@OneCricketeer, number of tasks created by spark are equal to number of minPartitions - by default there is 1:1 mapping between kafka partition and spark tasks, but since I'm using 400 as minPartitions, now there are two tasks against each kafka partition. You've mentioned about offset-resets : what does this mean ? Are you pointing to startingOffset option ? — Sandie, Jul 17 '23 at 11:45
_now there are two tasks against each kafka partition_ - Kafka won't allow this, only one consumer thread will be active at any time for a partition... Regarding resets, yes and no. Setting that option will always use that value. See the [docs](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing) on checkpointing for resuming of state for Kafka offsets. as well as "Continuous Processing" for improved throughput — OneCricketeer, Jul 17 '23 at 17:11
I would like to understand the impact of "minPartition" here. Should I remove it or keep it ? I'm using checkpointing to s3 while writing this stream. — Sandie, Jul 17 '23 at 18:47

How to improve spark consumption of messages from kafka

0 Answers0