I am trying to have a pyspark code to stream the data from delta table and perform the merge against final delta target continuously at an interval of 10 - 15 mins between each cycle.
I have written a simple pyspark code and submitting the job in spark shell using the command "spark-submit gs://<pyspark_script>>.py". However, the script runs once and does not take the next cycle.
Code sample :
SourceDF.writeStream
.format("delta")
.outputMode("append") -- I have also tried "update"
.foreachBatch(mergeToDelta)
.option("checkpointLocation","gs:<<path_for_the_checkpint_location>>")
.trigger(processingTime="10 minutes") -- I have tried continuous='10 minutes"
.start()
How to submit the spark jobs in dataproc in google cloud for continuous streaming?
Both source and target for streaming job are delta tables.