0

I am trying to have a pyspark code to stream the data from delta table and perform the merge against final delta target continuously at an interval of 10 - 15 mins between each cycle.

I have written a simple pyspark code and submitting the job in spark shell using the command "spark-submit gs://<pyspark_script>>.py". However, the script runs once and does not take the next cycle.

Code sample :

SourceDF.writeStream
  .format("delta")
  .outputMode("append") -- I have also tried "update"
  .foreachBatch(mergeToDelta)
  .option("checkpointLocation","gs:<<path_for_the_checkpint_location>>")
  .trigger(processingTime="10 minutes") -- I have tried continuous='10 minutes"
  .start()

How to submit the spark jobs in dataproc in google cloud for continuous streaming?

Both source and target for streaming job are delta tables.

Rak
  • 196
  • 2
  • 9

0 Answers0