0

I have created a delta table and now I'm trying to perform merge data to that table using foreachBatch(). I've followed this example. I am running this code in dataproc image 1.5x in google cloud.

Spark version 2.4.7 Delta version 0.6.0

My code looks as follows:

from delta.tables import *

spark = SparkSession.builder \
    .appName("streaming_merge") \
    .master("local[*]") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()
  
# Function to upsert `microBatchOutputDF` into Delta table using MERGE
def mergeToDelta(microBatchOutputDF, batchId): 
  (deltaTable.alias("accnt").merge(                                            
       microBatchOutputDF.alias("updates"), \
       "accnt.acct_nbr = updates.acct_nbr") \
     .whenMatchedDelete(condition = "updates.cdc_ind='D'") \
     .whenMatchedUpdateAll(condition = "updates.cdc_ind='U'") \
     .whenNotMatchedInsertAll(condition = "updates.cdc_ind!='D'") \
     .execute()
  )

deltaTable = DeltaTable.forPath(spark, "gs:<<path_for_the_target_delta_table>>")

# Define the source extract
SourceDF = (
  spark.readStream \
    .format("delta") \
    .load("gs://<<path_for_the_source_delta_location>>")

# Start the query to continuously upsert into target tables in update mode
SourceDF.writeStream \
  .format("delta") \
  .outputMode("update") \
  .foreachBatch(mergeToDelta) \
  .option("checkpointLocation","gs:<<path_for_the_checkpint_location>>") \
  .trigger(once=True) \
  .start() \

This code runs without any problems, but there is no data written to the delta table, I doubt foreachBatch is not getting invoked. Anyone know what I'm doing wrong?

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
Rak
  • 196
  • 2
  • 9
  • if you already run your code before, then the last position is stored in the checkpoint, and until upstream changes, you'll don't get new changes... If you want to reprocess everything, try to remove checkpoint. Also, add logging to the foreachbatch function – Alex Ott Feb 14 '21 at 09:17
  • When I execute the same steps individually in the spark shell I cloud see the foreachBatch getting invoked. However when I run the script in the terminal using spark-shell .py it is not invoking. Just not sure if I am doing this right. I have tried 1. gcloud dataproc jobs submit pyspark --cluster=xxxxx --region=xxxx gs://>.py 2. directly doing the ssh to the dataproc cluster and trigger using "spark-submit gs://>.py" – Rak Feb 17 '21 at 15:17

1 Answers1

0

After adding awaitTermination, streaming started working and picked up the latest data from the source and performed the merge on delta target table.

Rak
  • 196
  • 2
  • 9