1

I am running a spark structured streaming job which involves creation of an empty dataframe, updating it using each micro-batch as below. With every micro batch execution, number of stages increases by 4. To avoid recomputation, I am persisting the updated StaticDF into memory after each update inside loop. This helps in skipping those additional stages which gets created with every new micro batch.

My questions -

1) Even though the total completed stages remains same as the increased stages are always skipped but can it cause a performance issue as there can be millions on skipped stages at one point of time?
2) What happens when somehow some part or all of cached RDD is not available? (node/executor failure). Spark documentation says that it doesn't materialise the whole data received from multiple micro batches so far so does it mean that it will need read all events again from Kafka to regenerate staticDF?

// one time creation of empty static(not streaming) dataframe
val staticDF_schema = new StructType()
      .add("product_id", LongType)
      .add("created_at", LongType)
var staticDF = sparkSession
.createDataFrame(sparkSession.sparkContext.emptyRDD[Row], staticDF_schema)

// Note : streamingDF was created from Kafka source
    streamingDF.writeStream
      .trigger(Trigger.ProcessingTime(10000L))
      .foreachBatch {
        (micro_batch_DF: DataFrame) => {

        // fetching max created_at for each product_id in current micro-batch
          val staging_df = micro_batch_DF.groupBy("product_id")
            .agg(max("created").alias("created"))

          // Updating staticDF using current micro batch
          staticDF = staticDF.unionByName(staging_df)
          staticDF = staticDF
            .withColumn("rnk",
              row_number().over(Window.partitionBy("product_id").orderBy(desc("created_at")))
            ).filter("rnk = 1")
            .drop("rnk")
              .cache()

          }

enter image description here

user3190018
  • 890
  • 13
  • 26
conetfun
  • 1,605
  • 4
  • 17
  • 38
  • Isn't that what checkpoints are for? https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing – mazaneicha Apr 14 '20 at 11:36
  • Does this answer your question? [What does "Stage Skipped" mean in Apache Spark web UI?](https://stackoverflow.com/questions/34580662/what-does-stage-skipped-mean-in-apache-spark-web-ui) – user10938362 Apr 14 '20 at 12:26
  • @mazaneicha Thanks, I am looking for checkpointing documentation here which says that it stores the offsets and intermediate aggregates to checkpoint location. It is still no clear that it will store the staticDF on HDFS or only use the offsets to read everything again from Kafka. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing – conetfun Apr 14 '20 at 12:27
  • @user10938362 Unfortunately not, I understand recompute won't happen as results are already available but I am concerned if there will be an overhead during DAG creation as number of total stages can reach millions (even though completed stages will always remain 7 and rest will be skipped) – conetfun Apr 14 '20 at 12:30
  • Checkpoints save both metadata and state. – mazaneicha Apr 14 '20 at 12:49
  • @mazaneicha Is there a way to test it? My goal would be to test that in case of missing cached RDD, it should not get all data from Kafka and instead users the dataframe's data stored in checkpoint – conetfun Apr 14 '20 at 12:53

1 Answers1

0

Even though the skipped stages doesn't need any computation but my job started failing after a certain number of batches. This was because of DAG growth with every batch execution, making it un-manageable and throwing stack overflow exception.

To avoid this, I had to break the spark lineage so that number of stages don't increase with every run (even if they are skipped)

conetfun
  • 1,605
  • 4
  • 17
  • 38
  • How do you break the spark lineage? – thentangler Sep 27 '20 at 04:02
  • As my data was small enough, I stored it in a scala map and created a new RDD in every micro batch using spark context (instead of using a cached rdd). This removes the dependency on previous RDD to calculate new RDD and breaks the lineage. I am also updating this scala map using values that I get in each micro batch so that its updated always. – conetfun Sep 28 '20 at 06:42
  • Hi ! I have a same problem with you, I have some skipped step on my job, I want to ask you about how you handle this ? I still didn't understand what you mean about spark lineage, Is it solving the problem to prevent any skipped job ? – MADFROST Sep 13 '21 at 03:06
  • @RudyTriSaputra Please post a separate question with details for your issues. My issue could be completely different from yours. – conetfun Sep 13 '21 at 08:13
  • God thank you for response my comment, I have a problem related to performance issues, and also I have a job that has skipped stage. This is my question [link](https://stackoverflow.com/questions/69157141/performance-issues-write-to-synapse-in-spark) – MADFROST Sep 13 '21 at 08:29