There is a data lake of CSV files that's updated throughout the day. I'm trying to create a Spark Structured Streaming job with the Trigger.Once
feature outlined in this blog post to periodically write the new data that's been written to the CSV data lake in a Parquet data lake.
Here's what I have:
val df = spark
.readStream
.schema(s)
.csv("s3a://csv-data-lake-files")
The following command wrote all the data to the Parquet lake, but didn't stop after all the data was written (I had to manually cancel the job).
processedDf
.writeStream
.trigger(Trigger.Once)
.format("parquet")
.option("checkpointLocation", "s3-path-to-checkpoint")
.start("s3-path-to-parquet-lake")
The following job also worked, but didn't stop after all the data was written either (I had to manually cancel the job):
val query = processedDf
.writeStream
.trigger(Trigger.Once)
.format("parquet")
.option("checkpointLocation", "s3-path-to-checkpoint")
.start("s3-path-to-parquet-lake")
query.awaitTermination()
The following command stopped the query before any data got written.
val query = processedDf
.writeStream
.trigger(Trigger.Once)
.format("parquet")
.option("checkpointLocation", "s3-path-to-checkpoint")
.start("s3-path-to-parquet-lake")
query.stop()
How can I configure the writeStream
query to wait until all the incremental data has been written to Parquet files and then stop?