0

I'm quite new to Structured Streaming and would like to understand a bit more in detail the main metrics of Spark.

I have a Structured Streaming process in Databricks that reads events from one Eventhub, read values from those events, creates a new df and writes this new df into a second Eventhub.

The event that comes from the first Eventhub, is an eventgrid event from which I read a url (when a blob is added to a storage account) and inside a foreachBatch, I create a new DF and write it to the second Eventhub.

The code has the following structure:


val streamingInputDF = 
  spark.readStream
    .format("eventhubs")
    .options(eventHubsConf.toMap)
    .load()
    .select(($"body").cast("string"))
    
    
def get_func( batchDF:DataFrame, batchID:Long ) : Unit = {
  
  batchDF.persist()
  for (row <- batchDF.rdd.collect) {   //necessary to read the file with spark.read....
     
              val file_url = "/mnt/" + path
              
              // create df from readed url
              val df = spark
                  .read
                  .option("rowTag", "Transaction")
                  .xml(file_url)

                if (!(df.rdd.isEmpty)){
                  
                  // some filtering
                  val eh_df = df.select(col(...).as(...),
                  val eh_jsoned = eh_df.toJSON.withColumnRenamed("value", "body")
                  
                  // write to Eventhub
                  eh_jsoned.select("body")
                      .write
                      .format("eventhubs")
                      .options(eventHubsConfWrite.toMap)    
                      .save()    
                }      
      } 

    batchDF.unpersist()

}


val query_test= streamingSelectDF
                .writeStream
                .queryName("query_test")
                .foreachBatch(get_func _)
                .start()

I have tried adding the maxEventsPerTrigger(100) parameter but this increases a lot the time from when the data arrives to the Storage Account until it is consumed in Databricks.

The value for maxEventsPerTrigger is set randomly in order to test behaviour.

Having seen the metrics, what sense does it make that the batch time is increasing so much and the processing rate and input rate are similar? enter image description here

What approach should I consider to improve the process? I'm running it from a Databricks 7.5 Notebook, Spark 3.0.1 and Scala 2.12.

Thank you all very much in advance.

NOTE:

  • XML files have the same size
  • First Eventhub has 20 partitions
  • Rate data input to first Eventhub is 2 events/sec
basigow
  • 145
  • 1
  • 11
  • This cannot be answered without knowing how much data is available in the EventHub and at which rate data is sent into the input EventHub. Also, it is unclear to which value the maxEventsPerTrigger is set and how much data is actually being consumed within each micro-batch over time. Just as a side note: It looks like you do not need to persist and unpersist the batchDF as you are not using it more than once. – Michael Heil Mar 19 '21 at 10:20
  • Are all XML files equally sized? Is it guaranteed that in each micro-batch you have the same amount of empty DFs and non-empty DFs? You may realise by my questions, that it is very hard to provide a clear answer to your question. – Michael Heil Mar 19 '21 at 10:22
  • 1
    You also want to look [here](https://stackoverflow.com/questions/65777481/read-file-path-from-kafka-topic-and-then-read-file-and-write-to-deltalake-in-str/65809786#65809786) on how to have cleaner code within the `get_func` body. – Michael Heil Mar 19 '21 at 10:24
  • Many thanks @mike, I've updated the question in order to provide more info. As for the ``if (!(df.rdd.isEmpty))`` condition, I use it because I occasionally get exceptions for empty dataframes. Maybe this can be improved with a try catch. – basigow Mar 19 '21 at 11:19

0 Answers0