pyspark - how to run and schedule streaming jobs in dataproc hosted on GCP

Asked Feb 17 '21 at 17:52

Active Feb 17 '21 at 17:58

Viewed 297 times

I am trying to have a pyspark code to stream the data from delta table and perform the merge against final delta target continuously at an interval of 10 - 15 mins between each cycle.

I have written a simple pyspark code and submitting the job in spark shell using the command "spark-submit gs://<pyspark_script>>.py". However, the script runs once and does not take the next cycle.

Code sample :

SourceDF.writeStream
  .format("delta")
  .outputMode("append") -- I have also tried "update"
  .foreachBatch(mergeToDelta)
  .option("checkpointLocation","gs:<<path_for_the_checkpint_location>>")
  .trigger(processingTime="10 minutes") -- I have tried continuous='10 minutes"
  .start()

How to submit the spark jobs in dataproc in google cloud for continuous streaming?

Both source and target for streaming job are delta tables.

edited Feb 17 '21 at 17:58

asked Feb 17 '21 at 17:52

Rak

2

You need to add awaitTermination. – Michael Heil Feb 17 '21 at 19:20
Thank you mike, this works.! – Rak Mar 29 '21 at 18:04

pyspark - how to run and schedule streaming jobs in dataproc hosted on GCP

0 Answers0