I am trying to run the spark streaming application with Kafka using yarn. I am getting the following Stack trace error-
Caused by: org.apache.kafka.common.config.ConfigException: Missing required configuration "partition.assignment.strategy" which has no default value. at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:124) at org.apache.kafka.common.config.AbstractConfig.(AbstractConfig.java:48) at org.apache.kafka.clients.consumer.ConsumerConfig.(ConsumerConfig.java:194) at org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:380) at org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:363) at org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:350) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.(CachedKafkaConsumer.scala:45) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer$.get(CachedKafkaConsumer.scala:194) at org.apache.spark.streaming.kafka010.KafkaRDDIterator.(KafkaRDD.scala:252) at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:212) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
Here is snippet of my code of how I am creating my KafkaStream with spark stream-
val ssc = new StreamingContext(sc, Seconds(60))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "*boorstrap_url:port*",
"security.protocol" -> "SASL_PLAINTEXT",
"sasl.kerberos.service.name" -> "kafka",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "annotation-test",
//Tried commenting and uncommenting this property
//"partition.assignment.strategy"->"org.apache.kafka.clients.consumer.RangeAssignor",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean))
val topics = Array("*topic-name*")
val kafkaStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams))
val valueKafka = kafkaStream.map(record => record.value())
I have gone through the following post -
- https://issues.apache.org/jira/browse/KAFKA-4547
- Pyspark Structured Streaming Kafka configuration error
According to this I have updated my kafka util jar in my fat jar to 0.10.2.0 version from 0.10.1.0 packaged by default from spark-stream-kafka-jar as transient dependency. Also my job is working fine when I am running it on single node by setting master as local. I am running spark 2.3.1 version.