How can I control the amount of files being processed for each trigger in Spark Structured Streaming using the "Trigger once" trigger?

Question

I am trying to use Spark Structured Streaming's feature, Trigger once, to mimic a batch alike setup. However, I run into some trouble when I am running my initial batch, because I have a lot of historic data, and for this reason I am also using the option .option("cloudFiles.includeExistingFiles", "true") to also process existing files.

So my initial batch becomes very big, since I cannot control the amount of files for the batch.

I have also tried to use the option cloudFiles.maxBytesPerTrigger, however, this is ignored when you use Trigger once --> https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html

When I specify the maxFilesPerTrigger option, it is ignored as well. It just takes all files available.

My code looks like this:

df = (
  spark.readStream.format("cloudFiles")
    .schema(schemaAsStruct)
    .option("cloudFiles.format", sourceFormat)
    .option("delimiter", delimiter)
    .option("header", sourceFirstRowIsHeader)
    .option("cloudFiles.useNotifications", "true")
    .option("cloudFiles.includeExistingFiles", "true")
    .option("badRecordsPath", badRecordsPath)
    .option("maxFilesPerTrigger", 1)
    .option("cloudFiles.resourceGroup", omitted)
    .option("cloudFiles.region", omitted)
    .option("cloudFiles.connectionString", omitted)
    .option("cloudFiles.subscriptionId", omitted)
    .option("cloudFiles.tenantId", omitted)
    .option("cloudFiles.clientId", omitted)
    .option("cloudFiles.clientSecret", omitted)
    .load(sourceBasePath)
)

# Traceability columns
df = (
  df.withColumn(sourceFilenameColumnName, input_file_name()) 
    .withColumn(processedTimestampColumnName, lit(processedTimestamp))
    .withColumn(batchIdColumnName, lit(batchId))
)

def process_batch(batchDF, id):
  batchDF.persist()
  
  (batchDF
     .write
     .format(destinationFormat)
     .mode("append")
     .save(destinationBasePath + processedTimestampColumnName + "=" +  processedTimestamp)
  )
    
  (batchDF
   .groupBy(sourceFilenameColumnName, processedTimestampColumnName)
   .count()
   .write
   .format(destinationFormat)
   .mode("append")
   .save(batchSourceFilenamesTmpDir))
  
  batchDF.unpersist()

stream = (
  df.writeStream
    .foreachBatch(process_batch)
    .trigger(once=True)
    .option("checkpointLocation", checkpointPath)
    .start()
)

As you can see, I am using the cloudfiles format, which is the format of the Databricks Autoloader --> https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html

"Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage.

Auto Loader provides a Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory"

If I somehow present my problem in a confusing way or it's lacking information, please say so.

Any update, is this issue fix in latest version? – Syed Mohammed Mehdi Sep 30 '22 at 14:44 — Syed Mohammed Mehdi, Sep 30 '22 at 14:44

Alex Ott · Answer 1 · 2021-10-12T07:45:55.450

5

Unfortunately Spark 3.x (DBR >= 7.x) is completely ignoring options like maxFilesPerTrigger, etc. that are limiting an amount of data pulled for processing - in this case it will try to process all data in one go, and sometimes it may lead to a performance problems.

To workaround that you may do following hack check periodically the value of stream.get('numInputRows'), and if it's equal to 0 for a some period of time, issue stream.stop()

Update, October 2021st: it looks like that it will be fixed in Spark 3.3 by introducing new trigger type - Trigger.AvailableNow (see SPARK-36533)

edited Oct 12 '21 at 07:45

answered Jul 12 '21 at 12:15

Alex Ott

80,552
8
87
132

Hi Alex, thank you for the response. Glad to get some confirmation to my own suspicion. Have an awesome day! – Mathias Bigler Jul 12 '21 at 12:39
I have same issue. How can you control number of files being processed by stream.get('numInputRows'). 0 means processing is done, but it still process entire files. Any suggestions or sample code? – smishra Aug 27 '21 at 13:51
1

@MathiasBigler see updated asnwer – Alex Ott Oct 12 '21 at 07:45
Any update, is this issue fix in latest version? – Syed Mohammed Mehdi Sep 30 '22 at 14:43
`Trigger.AvailableNow` is recommended now – Alex Ott Sep 30 '22 at 14:52
I'm already using Trigger.AvailableNow but it is not working – Syed Mohammed Mehdi Oct 11 '22 at 07:22
Define "not working" - do you get specific error message, or something else? – Alex Ott Oct 11 '22 at 09:06
By "not working" I mean - I'm defining maxFilesPerTrigger as 10,000 but it is not batching them in 10,000 files instead it is some random batching of 300 to 400 files. – Syed Mohammed Mehdi Oct 11 '22 at 14:20

score 0 · Answer 2 · answered Oct 03 '21 at 15:08

0

What exactly is the trouble that you are running into on your initial batch, can you provide more details or error messages?

Why is it a problem that your initial batch is very big? If you have a lot of historical data, this would be expected.

Something to consider is sub folders - are your files located in sub folders, or only in the root sourceBasePath path? If within sub folders, try using this option for readStream:

option("recursiveFileLookup", "true")

I found this resolved my Auto Loader issues as I had data files landing in sub folders / partitions.

answered Oct 03 '21 at 15:08

Zegor

1
1
1

Well, the problem is that I cannot control how many files a batch should at maximum consist of. Let's say the last batch was two hours ago and since then, 100.000 new files has shown up in the source directory. But I only want to process 50.000 files at maximum per batch - how can I control this? This can become a problem for the cluster running if it isn't big enough to handle 100.000 files in a batch. – Mathias Bigler Oct 04 '21 at 20:53
If that's the case, perhaps the initial load can be done manually using non-structured streaming queries - and batched as necessary - and then be merged with a structured streaming / Auto Loader query. – Zegor Oct 04 '21 at 23:52

How can I control the amount of files being processed for each trigger in Spark Structured Streaming using the "Trigger once" trigger?

2 Answers2