0

Problem

The goal is to have a Spark Streaming application that read data from Kafka and use Delta Lake to create store data. The granularity of the delta table is pretty granular, the first partition is the organization_id (there are more than 5000 organizations) and the second partition is the date.

The application has a expected latency, but it does not last more than one day up. The error is always about memory as I'll show below.

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006f8000000, 671088640, 0) failed; error='Cannot allocate memory' (errno=12)

There is no persistence and the memory is already high for the whole application.

What I've tried

Increasing memory and workes were the first things I've tried, but the number of partitions were changed as well, from 4 to 16.

Script of Execution

spark-submit \
  --verbose \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 2G \
  --executor-memory 4G \
  --executor-cores 2 \
  --num-executors 4 \
  --files s3://my-bucket/log4j-driver.properties,s3://my-bucket/log4j-executor.properties \
  --jars /home/hadoop/delta-core_2.12-0.8.0.jar,/usr/lib/spark/external/lib/spark-sql-kafka-0-10.jar \
  --class my.package.app \
  --conf spark.driver.memoryOverhead=512 \
  --conf spark.executor.memoryOverhead=1024 \
  --conf spark.memory.fraction=0.8 \
  --conf spark.memory.storageFraction=0.3 \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.rdd.compress=true \
  --conf spark.yarn.max.executor.failures=100 \
  --conf spark.yarn.maxAppAttempts=100 \
  --conf spark.task.maxFailures=100 \
  --conf spark.executor.heartbeatInterval=20s \
  --conf spark.network.timeout=300s \
  --conf spark.driver.maxResultSize=0 \
  --conf spark.driver.extraJavaOptions="-XX:-PrintGCDetails -XX:-PrintGCDateStamps -XX:-UseParallelGC -XX:+UseG1GC -XX:-UseConcMarkSweepGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/dump-driver.hprof -Dlog4j.configuration=log4j-driver.properties -Dvm.logging.level=ERROR -Dvm.logging.name=UsageFact -Duser.timezone=UTC" \
  --conf spark.executor.extraJavaOptions="-XX:-PrintGCDetails -XX:-PrintGCDateStamps -XX:-UseParallelGC -XX:+UseG1GC -XX:-UseConcMarkSweepGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/dump-executor.hprof -Dlog4j.configuration=log4j-executor.properties -Dvm.logging.level=ERROR -Dvm.logging.name=UsageFact -Duser.timezone=UTC" \
  --conf spark.sql.session.timeZone=UTC \
  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
  --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore \
  --conf spark.databricks.delta.retentionDurationCheck.enabled=false \
  --conf spark.databricks.delta.vacuum.parallelDelete.enabled=true \
  --conf spark.sql.shuffle.partitions=16 \
  --name "UsageFactProcessor" \
  application.jar

Code

    val source = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", broker)
      .option("subscribe", topic)
      .option("startingOffsets", "latest")
      .option("failOnDataLoss", value = false)
      .option("fetchOffset.numRetries", 10)
      .option("fetchOffset.retryIntervalMs", 1000)
      .option("maxOffsetsPerTrigger", 50000L)
      .option("kafkaConsumer.pollTimeoutMs", 300000L)
      .load()

    val transformed = source
      .transform(applySchema)

    val query = transformed
      .coalesce(16)
      .writeStream
      .trigger(Trigger.ProcessingTime("1 minute"))
      .outputMode(OutputMode.Append)
      .format("delta")
      .partitionBy("organization_id", "date")
      .option("path", table)
      .option("checkpointLocation", checkpoint)
      .option("mergeSchema", "true")
      .start()

    spark.catalog.clearCache()
    query.awaitTermination()

Versions

Spark: 3.0.1

Delta: 0.8.0

Question

What do you think may be causing this problem?

1 Answers1

1

Just upgraded the version to Delta.io 1.0.0 and it stopped happening.