Spark Streaming kafka concurrentModificationException

Question

I am using a Spark streaming application. Application reads messages from Kafka topic (with 200 partitions) using a directstream. Occasionally the application throws ConcurrentModificationException->

java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access
at org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1361)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer$$anon$1.removeEldestEntry(CachedKafkaConsumer.scala:128)
at java.util.LinkedHashMap.afterNodeInsertion(LinkedHashMap.java:299)
at java.util.HashMap.putVal(HashMap.java:663)
at java.util.HashMap.put(HashMap.java:611)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer$.get(CachedKafkaConsumer.scala:158)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.<init>(KafkaRDD.scala:211)
at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:186)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

My spark cluster has two nodes. Spark version is 2.1. The application runs two executors. From what I could make out from the exception and kafka consumer code, it seems that the same kakfa consumer is being used by two threads. I have no clue how come two threads are accessing the same receiver. Ideally each executor shall have an exclusive kafka receiver services by a single thread which must read messages for all the assigned partitions. The code snippet which is reading from kafka->

JavaInputDStream<ConsumerRecord<String, String>> consumerRecords = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));

Did you run multi-tasks in single executor? Try to set `useConsumerCache` to false. `KafkaConsumer` is not thread-safe, namely it cannot be used in multiple threads. — amethystic, Dec 04 '17 at 10:00
@amethystic: I am doing a spark submit for each job. I double checked and all the tasks have their respective executors (one driver and 2 executors to be exact). — scorpio, Dec 05 '17 at 10:24

score 1 · Accepted Answer · answered Dec 19 '17 at 12:50

1

In my case, the issue was linked to the kafka consumer cache size. I changed the size (default: 64 per executor) to 200 per executor (200 parallel consumers due to 200 partitions). I had to upgrade to Spark 2.2 as the option to change the size is not available in Spark 2.1.

spark.streaming.kafka.consumer.cache.maxCapacity=200

answered Dec 19 '17 at 12:50

scorpio

329
1
18

If I have 200 partitions and I have 2 executors. Should I increase the spark.streaming.kafka.consumer.cache.maxCapacity to 100? Because 2 executors*100 == 200? – Funzo Jul 05 '19 at 10:32
That setting is per executor VM. Since you have 200 partitions, the partitions per VM will be 100. So, you need to configure maxCapacity setting to 100 only. – scorpio Jul 05 '19 at 11:30

Spark Streaming kafka concurrentModificationException

1 Answers1