I'm running a bigger spark cluster with 96 cores cluster, and polling messages from kafka. There are around 200 total partitions from different topics, which are being processed in continuous streaming job. The parameters i'm using against kafka options are:
spark.streaming.kafka.maxRatePerPartition: 8k maxOffsetsPerTrigger: 10Million
But I can see a performance degrade with recent runs. Spark cluster is not craving for any resources, it does has plenty of memory and cpu utilisation is only 30%.
While looking into spark UI, there are just 1-5% of tasks which are taking longer to execute, rest are executing quicker. Data skewness is also visible.
On the executer side, the logs shows that spark is trying to read and reset the same partition offsets multiple times. Is it normal ? What can be done to optimize the performance of spark-kafka ? Is there any problem exists with Kafka cluster as well ? If so, then what are the things which needs to be verified on kafka side ?
df = ( spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", hosts)
.option("subscribe", topics)
.option("spark.streaming.kafka.maxRatePerPartition", 9000)
.option("startingOffsets", "latest")
.option("maxOffsetsPerTrigger", 10000000)
.option("minPartitions", 400)
.option("failOnDataLoss", False)
.load() )