0

We have a repository of input files like <rootpath>\2023\01\*\*Events*.xml => This represents the path to input XML files that need to be read in Spark Structured Streaming, so that the events can be then parsed, converted to relevant Dataframe and then stored to Delta tables. The input folder has around 150,000 files with an average size of 2MB of each file.

The reading is done as below:

spark.readStream
      .format("text")
      .option("wholeText","true")
      .option("maxFilesPerTrigger", maxFilesPerTrigger)
      .load(inputFolder)

And transformation is done using :

interimDF.writeStream
      .queryName("EventsDataStreaming")
      .foreachBatch(writeInBatch)
      .option("checkpointLocation",checkpointFolder)
      .trigger(Trigger.AvailableNow())
      .outputMode("append")
      .start()
      .awaitTermination()

The strange problem is, it took nearly 14 hours to read the files! Means in the in Databricks "Spark UI" there were no transformation activity in the first 14 hours. All the transformation and save to Delta were done in last 3 hours. For the job to succeed, I had to allocate a large cluster with larger memory of 60+ GB for driver and worker nodes. With smaller memory in driver and cluster nodes, the job aborts with OutofMemory error!

After this, I did another experiment. I split the job to smaller chunks by giving path datewise. For example,

first job : <rootpath>\2023\01\01\*Events*.xml

second job : <rootpath>\2023\01\02\*Events*.xml

These jobs ran much faster (3 to 5x ) and also completed with a smaller memory (14 GB each) cluster.

I want to know

  1. how we can avoid this initial delay of reading files (14 hours for 1 month data)
  2. how run without splitting to smaller jobs and using smaller memory cluster, but faster by at least 5x.

We are using Spark 3.3.0.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Ganesha
  • 145
  • 1
  • 10

0 Answers0