I am trying to use Spark Structured Streaming's feature, Trigger once, to mimic a batch alike setup. However, I run into some trouble when I am running my initial batch, because I have a lot of historic data, and for this reason I am also using the option .option("cloudFiles.includeExistingFiles", "true") to also process existing files.
So my initial batch becomes very big, since I cannot control the amount of files for the batch.
I have also tried to use the option cloudFiles.maxBytesPerTrigger, however, this is ignored when you use Trigger once --> https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html
When I specify the maxFilesPerTrigger option, it is ignored as well. It just takes all files available.
My code looks like this:
df = (
spark.readStream.format("cloudFiles")
.schema(schemaAsStruct)
.option("cloudFiles.format", sourceFormat)
.option("delimiter", delimiter)
.option("header", sourceFirstRowIsHeader)
.option("cloudFiles.useNotifications", "true")
.option("cloudFiles.includeExistingFiles", "true")
.option("badRecordsPath", badRecordsPath)
.option("maxFilesPerTrigger", 1)
.option("cloudFiles.resourceGroup", omitted)
.option("cloudFiles.region", omitted)
.option("cloudFiles.connectionString", omitted)
.option("cloudFiles.subscriptionId", omitted)
.option("cloudFiles.tenantId", omitted)
.option("cloudFiles.clientId", omitted)
.option("cloudFiles.clientSecret", omitted)
.load(sourceBasePath)
)
# Traceability columns
df = (
df.withColumn(sourceFilenameColumnName, input_file_name())
.withColumn(processedTimestampColumnName, lit(processedTimestamp))
.withColumn(batchIdColumnName, lit(batchId))
)
def process_batch(batchDF, id):
batchDF.persist()
(batchDF
.write
.format(destinationFormat)
.mode("append")
.save(destinationBasePath + processedTimestampColumnName + "=" + processedTimestamp)
)
(batchDF
.groupBy(sourceFilenameColumnName, processedTimestampColumnName)
.count()
.write
.format(destinationFormat)
.mode("append")
.save(batchSourceFilenamesTmpDir))
batchDF.unpersist()
stream = (
df.writeStream
.foreachBatch(process_batch)
.trigger(once=True)
.option("checkpointLocation", checkpointPath)
.start()
)
As you can see, I am using the cloudfiles format, which is the format of the Databricks Autoloader --> https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html
"Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage.
Auto Loader provides a Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory"
If I somehow present my problem in a confusing way or it's lacking information, please say so.