Problem
The goal is to have a Spark Streaming application that read data from Kafka and use Delta Lake to create store data. The granularity of the delta table is pretty granular, the first partition is the organization_id (there are more than 5000 organizations) and the second partition is the date.
The application has a expected latency, but it does not last more than one day up. The error is always about memory as I'll show below.
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006f8000000, 671088640, 0) failed; error='Cannot allocate memory' (errno=12)
There is no persistence and the memory is already high for the whole application.
What I've tried
Increasing memory and workes were the first things I've tried, but the number of partitions were changed as well, from 4 to 16.
Script of Execution
spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--driver-memory 2G \
--executor-memory 4G \
--executor-cores 2 \
--num-executors 4 \
--files s3://my-bucket/log4j-driver.properties,s3://my-bucket/log4j-executor.properties \
--jars /home/hadoop/delta-core_2.12-0.8.0.jar,/usr/lib/spark/external/lib/spark-sql-kafka-0-10.jar \
--class my.package.app \
--conf spark.driver.memoryOverhead=512 \
--conf spark.executor.memoryOverhead=1024 \
--conf spark.memory.fraction=0.8 \
--conf spark.memory.storageFraction=0.3 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.rdd.compress=true \
--conf spark.yarn.max.executor.failures=100 \
--conf spark.yarn.maxAppAttempts=100 \
--conf spark.task.maxFailures=100 \
--conf spark.executor.heartbeatInterval=20s \
--conf spark.network.timeout=300s \
--conf spark.driver.maxResultSize=0 \
--conf spark.driver.extraJavaOptions="-XX:-PrintGCDetails -XX:-PrintGCDateStamps -XX:-UseParallelGC -XX:+UseG1GC -XX:-UseConcMarkSweepGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/dump-driver.hprof -Dlog4j.configuration=log4j-driver.properties -Dvm.logging.level=ERROR -Dvm.logging.name=UsageFact -Duser.timezone=UTC" \
--conf spark.executor.extraJavaOptions="-XX:-PrintGCDetails -XX:-PrintGCDateStamps -XX:-UseParallelGC -XX:+UseG1GC -XX:-UseConcMarkSweepGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/dump-executor.hprof -Dlog4j.configuration=log4j-executor.properties -Dvm.logging.level=ERROR -Dvm.logging.name=UsageFact -Duser.timezone=UTC" \
--conf spark.sql.session.timeZone=UTC \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore \
--conf spark.databricks.delta.retentionDurationCheck.enabled=false \
--conf spark.databricks.delta.vacuum.parallelDelete.enabled=true \
--conf spark.sql.shuffle.partitions=16 \
--name "UsageFactProcessor" \
application.jar
Code
val source = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", broker)
.option("subscribe", topic)
.option("startingOffsets", "latest")
.option("failOnDataLoss", value = false)
.option("fetchOffset.numRetries", 10)
.option("fetchOffset.retryIntervalMs", 1000)
.option("maxOffsetsPerTrigger", 50000L)
.option("kafkaConsumer.pollTimeoutMs", 300000L)
.load()
val transformed = source
.transform(applySchema)
val query = transformed
.coalesce(16)
.writeStream
.trigger(Trigger.ProcessingTime("1 minute"))
.outputMode(OutputMode.Append)
.format("delta")
.partitionBy("organization_id", "date")
.option("path", table)
.option("checkpointLocation", checkpoint)
.option("mergeSchema", "true")
.start()
spark.catalog.clearCache()
query.awaitTermination()
Versions
Spark: 3.0.1
Delta: 0.8.0
Question
What do you think may be causing this problem?