2

I am using a Spark streaming application. Application reads messages from Kafka topic (with 200 partitions) using a directstream. Occasionally the application throws ConcurrentModificationException->

java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access
at org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1361)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer$$anon$1.removeEldestEntry(CachedKafkaConsumer.scala:128)
at java.util.LinkedHashMap.afterNodeInsertion(LinkedHashMap.java:299)
at java.util.HashMap.putVal(HashMap.java:663)
at java.util.HashMap.put(HashMap.java:611)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer$.get(CachedKafkaConsumer.scala:158)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.<init>(KafkaRDD.scala:211)
at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:186)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

My spark cluster has two nodes. Spark version is 2.1. The application runs two executors. From what I could make out from the exception and kafka consumer code, it seems that the same kakfa consumer is being used by two threads. I have no clue how come two threads are accessing the same receiver. Ideally each executor shall have an exclusive kafka receiver services by a single thread which must read messages for all the assigned partitions. The code snippet which is reading from kafka->

JavaInputDStream<ConsumerRecord<String, String>> consumerRecords = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
Ruslan Ostafiichuk
  • 4,422
  • 6
  • 30
  • 35
scorpio
  • 329
  • 1
  • 18

1 Answers1

1

In my case, the issue was linked to the kafka consumer cache size. I changed the size (default: 64 per executor) to 200 per executor (200 parallel consumers due to 200 partitions). I had to upgrade to Spark 2.2 as the option to change the size is not available in Spark 2.1.

spark.streaming.kafka.consumer.cache.maxCapacity=200

scorpio
  • 329
  • 1
  • 18
  • If I have 200 partitions and I have 2 executors. Should I increase the spark.streaming.kafka.consumer.cache.maxCapacity to 100? Because 2 executors*100 == 200? – Funzo Jul 05 '19 at 10:32
  • That setting is per executor VM. Since you have 200 partitions, the partitions per VM will be 100. So, you need to configure maxCapacity setting to 100 only. – scorpio Jul 05 '19 at 11:30