4

I'm using Spark 2.1.0 on EMR 5.5.0 (java 1.8.0_121) with YARN as a resource manager.

I've encountered an issue with spark where after a few hours the executor gets killed by Yarn after it has used-up all of the configured container's physical memory. I've located that the source of this issue is related to the Bzip2Codec, in specific: rdd.saveAsTextFile(path,classOf[BZip2Codec])

If i remove the compression or change it to another compression codec the issue is not produced.

my spark-submit:

spark-submit \
  --deploy-mode cluster \
  --master yarn \
  --executor-memory 1g \
  --num-executors 1 \
  --conf spark.yarn.executor.memoryOverhead=384 \
  --conf spark.yarn.submit.waitAppCompletion=false \
  --conf spark.streaming.backpressure.enabled=true \
  --conf spark.streaming.kafka.maxRatePerPartition=300 \
  --conf spark.streaming.dynamicAllocation.enabled=false \
  --conf spark.dynamicAllocation.enabled=false \
  --class com.kafkaStreaming.main.Main \
  /path/to/jar.jar

The code it self is very simple, it receives the stream from Kafka runs foreachRDD to save it to S3:

import kafka.serializer.StringDecoder
import org.apache.hadoop.io.compress.BZip2Codec
import org.joda.time.format.DateTimeFormat
import org.joda.time.{DateTime, DateTimeZone}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext, Time}

object Main {
  def main(args: Array[String]) {

    val outputDirectory = "s3://bucketName/sparkOutput/"
    val topic = "TopicName"
    val topicsSet = topic.split(",").toSet
    val kafkaBroker = 10.0.0.10:9092
    val kafkaParams = Map[String,String](("metadata.broker.list" -> kafkaBroker))
    val batchInterval = 5

    val sparkConf = new SparkConf().setAppName("sparkStreaming_tests")
    val ssc = new StreamingContext(sparkConf, Seconds(batchInterval))

    val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topicsSet)

    stream.map(_._2).foreachRDD {( rdd,time:Time) =>
      val batchTime=new DateTime(time.milliseconds)
      rdd.saveAsTextFile(
        outputDirectory +
          "/" + "part_date=" + batchTime.toString(DateTimeFormat.forPattern("yyyy-MM-dd")) +
          "/" + "part_hour=" + batchTime.toString(DateTimeFormat.forPattern("HH")) +
          "/" + batchTime.toString(DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss"))
        , classOf[BZip2Codec])
    }

    ssc.start()
    ssc.awaitTermination()
  }
}

Searching google i've came across this Jira: https://issues.apache.org/jira/browse/HADOOP-12619

Did someone came across this issue? Are there any known workarounds for this issue?

drord
  • 41
  • 2

0 Answers0