1

Background: I have written a simple spark structured steaming app to move data from Kafka to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata folder, which ends up growing too large, when the streaming app runs for a long time the metadata folder grows so big that we start getting OOM errors. I want to get rid of metadata and checkpoint folders of Spark Structured Streaming and manage offsets myself.

How we managed offsets in Spark Streaming: I have used val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges to get offsets in Spark Structured Streaming. But want to know how to get the offsets and other metadata to manage checkpointing ourself using Spark Structured Streaming. Do you have any sample program that implements checkpointing?

How we managed offsets in Spark Structured Streaming?? Looking at this JIRA https://issues-test.apache.org/jira/browse/SPARK-18258. looks like offsets are not provided. How should we go about?

The issue is in 6 hours size of metadata increased to 45MB and it grows till it reaches nearly 13 GB. Driver memory allocated is 5GB. At that time system crashes with OOM. Wondering how to avoid making this meta data grow so large? How to make metadata not log so much information.

Code:

1. Reading records from Kafka topic
  Dataset<Row> inputDf = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("subscribe", "topic1") \
  .option("startingOffsets", "earliest") \
  .load()
2. Use from_json API from Spark to extract your data for further transformation in a dataset.
   Dataset<Row> dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event"))
       ....withColumn("oem_id", col("metadata.oem_id"));
3. Construct a temp table of above dataset using SQLContext
   SQLContext sqlContext = new SQLContext(sparkSession);
   dataDf.createOrReplaceTempView("event");
4. Flatten events since Parquet does not support hierarchical data.
5. Store output in parquet format on S3
   StreamingQuery query = flatDf.writeStream().format("parquet")

enter image description here

Dataset dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event")) .select("event.metadata", "event.data", "event.connection", "event.registration_event","event.version_event" ); SQLContext sqlContext = new SQLContext(sparkSession); dataDf.createOrReplaceTempView("event"); Dataset flatDf = sqlContext .sql("select " + " date, time, id, " + flattenSchema(EVENT_SCHEMA, "event") + " from event"); StreamingQuery query = flatDf .writeStream() .outputMode("append") .option("compression", "snappy") .format("parquet") .option("checkpointLocation", checkpointLocation) .option("path", outputPath) .partitionBy("date", "time", "id") .trigger(Trigger.ProcessingTime(triggerProcessingTime)) .start(); query.awaitTermination();

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Alchemist
  • 849
  • 2
  • 10
  • 27

1 Answers1

1

For non-batch Spark Structured Streaming KAFKA integration:

Quote:

Structured Streaming ignores the offsets commits in Apache Kafka.

Instead, it relies on its own offsets management on the driver side which is responsible for distributing offsets to executors and for checkpointing them at the end of the processing round (epoch or micro-batch).

You need not worry if you follow the Spark KAFKA integration guides.

Excellent reference: https://www.waitingforcode.com/apache-spark-structured-streaming/apache-spark-structured-streaming-apache-kafka-offsets-management/read

For batch the situation is different, you need to manage that yourself and store the offsets.

UPDATE Based on the comments I suggest the question is slightly different and advise you look at Spark Structured Streaming Checkpoint Cleanup. In addition to your updated comments and the fact that there is no error, I suggest you consukt this on metadata for Spark Structured Streaming https://www.waitingforcode.com/apache-spark-structured-streaming/checkpoint-storage-structured-streaming/read. Looking at the code, different to my style, but cannot see any obvious error.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • Thanks for your response. I have written a simple spark structured steaming app to move data from Kafka to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata folder, which ends up growing too large as the streaming app is SUPPOSE TO run FOREVER. But when the streaming app runs for a long time the metadata folder grows so big that we start getting OOM errors. Only way to resolve OOM is delete Checkpoint and Metadata folder and loose VALUABLE customer data. Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and SPARK-24295) – Alchemist Jun 17 '20 at 19:08
  • That's the down-side of the situation now. There are various posts on people trying to alleviate this, but with check-pointing there is no issue. It will all re-start in non-batch. Puzzled as I just re-read the Databricks Guide on SSS for certification and they mention nothing on that. With writeahead log etc and check-pointing there should be no issue. I have noted various errors in their courses though. The main issue is too stop with schema changes etc. – thebluephantom Jun 17 '20 at 19:13
  • You might want to amend your question and title. – thebluephantom Jun 17 '20 at 19:13
  • Thanks so much @thebluphantom.. Thinking of moving BACK to SPARK STREAMING (batch version). Looks like specifying checkpoint is mandatory. I have tried to remove the checkpoint... I cannot start the application now. 2020-06-17 20:00:04,222 ERROR [Driver] org.apache.spark.deploy.yarn.ApplicationMaster:User class threw exception: org.apache.spark.sql.AnalysisException: checkpointLocation must be specified either through option("checkpointLocation", ...) or – Alchemist Jun 17 '20 at 20:15
  • Then how to get rid of this OOM caused by _spark_metadata size growing so large... – Alchemist Jun 17 '20 at 20:21
  • look at that SO link – thebluephantom Jun 17 '20 at 20:22
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/216164/discussion-between-alchemist-and-thebluephantom). – Alchemist Jun 17 '20 at 20:27
  • should i move to Flink or Spark Streaming.. really lost eventually need to create a data pipleline using some kind of ML... so cannot use Kafka connect. – Alchemist Jun 18 '20 at 05:19
  • What I do not get is the folders vs OOM. They are in my mind 2 different things. Looking at the question and comments it is now unclear what is being asked. May be a more focused question with the error message shown. In any event I can say Streaming is legacy and Flink all the rage - but how to call on this? That link I added to the question, does that not state what you need to do? Seems to me it helps. Alos, withWaterMark, do you need that? Please re-edit the question with actual errors shown. That should help, people like OneCricketeer tend to know all on this topic. – thebluephantom Jun 18 '20 at 07:26
  • There is no error but system just hung after 3-5 days... I assumed it may be because of metadata... But I guess I reduced the size of metadata by setting sink configurations... I will get back to you when I get any error. Thanks again so much!! – Alchemist Jun 19 '20 at 10:41