7

I am trying to use Spark Structured Streaming's feature, Trigger once, to mimic a batch alike setup. However, I run into some trouble when I am running my initial batch, because I have a lot of historic data, and for this reason I am also using the option .option("cloudFiles.includeExistingFiles", "true") to also process existing files.

So my initial batch becomes very big, since I cannot control the amount of files for the batch.

I have also tried to use the option cloudFiles.maxBytesPerTrigger, however, this is ignored when you use Trigger once --> https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html

When I specify the maxFilesPerTrigger option, it is ignored as well. It just takes all files available.

My code looks like this:

df = (
  spark.readStream.format("cloudFiles")
    .schema(schemaAsStruct)
    .option("cloudFiles.format", sourceFormat)
    .option("delimiter", delimiter)
    .option("header", sourceFirstRowIsHeader)
    .option("cloudFiles.useNotifications", "true")
    .option("cloudFiles.includeExistingFiles", "true")
    .option("badRecordsPath", badRecordsPath)
    .option("maxFilesPerTrigger", 1)
    .option("cloudFiles.resourceGroup", omitted)
    .option("cloudFiles.region", omitted)
    .option("cloudFiles.connectionString", omitted)
    .option("cloudFiles.subscriptionId", omitted)
    .option("cloudFiles.tenantId", omitted)
    .option("cloudFiles.clientId", omitted)
    .option("cloudFiles.clientSecret", omitted)
    .load(sourceBasePath)
)

# Traceability columns
df = (
  df.withColumn(sourceFilenameColumnName, input_file_name()) 
    .withColumn(processedTimestampColumnName, lit(processedTimestamp))
    .withColumn(batchIdColumnName, lit(batchId))
)

def process_batch(batchDF, id):
  batchDF.persist()
  
  (batchDF
     .write
     .format(destinationFormat)
     .mode("append")
     .save(destinationBasePath + processedTimestampColumnName + "=" +  processedTimestamp)
  )
    
  (batchDF
   .groupBy(sourceFilenameColumnName, processedTimestampColumnName)
   .count()
   .write
   .format(destinationFormat)
   .mode("append")
   .save(batchSourceFilenamesTmpDir))
  
  batchDF.unpersist()

stream = (
  df.writeStream
    .foreachBatch(process_batch)
    .trigger(once=True)
    .option("checkpointLocation", checkpointPath)
    .start()
)

As you can see, I am using the cloudfiles format, which is the format of the Databricks Autoloader --> https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html

"Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage.

Auto Loader provides a Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory"

If I somehow present my problem in a confusing way or it's lacking information, please say so.

2 Answers2

5

Unfortunately Spark 3.x (DBR >= 7.x) is completely ignoring options like maxFilesPerTrigger, etc. that are limiting an amount of data pulled for processing - in this case it will try to process all data in one go, and sometimes it may lead to a performance problems.

To workaround that you may do following hack check periodically the value of stream.get('numInputRows'), and if it's equal to 0 for a some period of time, issue stream.stop()

Update, October 2021st: it looks like that it will be fixed in Spark 3.3 by introducing new trigger type - Trigger.AvailableNow (see SPARK-36533)

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
0

What exactly is the trouble that you are running into on your initial batch, can you provide more details or error messages?

Why is it a problem that your initial batch is very big? If you have a lot of historical data, this would be expected.

Something to consider is sub folders - are your files located in sub folders, or only in the root sourceBasePath path? If within sub folders, try using this option for readStream:

option("recursiveFileLookup", "true")

I found this resolved my Auto Loader issues as I had data files landing in sub folders / partitions.

Zegor
  • 1
  • 1
  • 1
  • Well, the problem is that I cannot control how many files a batch should at maximum consist of. Let's say the last batch was two hours ago and since then, 100.000 new files has shown up in the source directory. But I only want to process 50.000 files at maximum per batch - how can I control this? This can become a problem for the cluster running if it isn't big enough to handle 100.000 files in a batch. – Mathias Bigler Oct 04 '21 at 20:53
  • If that's the case, perhaps the initial load can be done manually using non-structured streaming queries - and batched as necessary - and then be merged with a structured streaming / Auto Loader query. – Zegor Oct 04 '21 at 23:52