2

I was working with streaming in delta tables with foreachbatch.

spark.readStream.format("delta") \
  .option("readChangeFeed", "true").option("startingVersion","latest") \
  .load("dbfs:/mnt/cg/file_to_up/table2") \
  .writeStream.queryName("myquery") \
  .foreachBatch(merge_deltas) \ 
  .option("checkpointLocation", "dbfs:/mnt/cg/tmp/checkpoint") \
  .trigger(processingTime='2 seconds') \
  .start()

I would like to know if there is a way to do the merge between a delta table(source) and synapse table(target) I found this page here in "Write to Azure Synapse Analytics using foreachBatch() in Python" is the closest I have found to what I need, I think that use the modes(append, overwrite) of the write will not help me.

Also, I know that it could be simulated using insert and delete in post and pre action option, but if there is way to do it with the merge that would be great.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
BryC
  • 89
  • 6

1 Answers1

0

You can use a combination of merge and foreachBatch (see foreachbatch for more information) to write complex upserts from a streaming query into a Delta table. For example:

• Write streaming aggregates in Update Mode: This is much more efficient than Complete Mode.

• Write a stream of database changes into a Delta table: The merge query for writing change data can be used in foreachBatch to continuously apply a stream of changes to a Delta table.

• Write a stream data into Delta table with deduplication: The insert-only merge query for deduplication can be used in foreachBatch to continuously write data (with duplicates) to a Delta table with automatic deduplication.

For more information refer this official documentation

Note -

• Make sure that your merge statement inside foreachBatch is idempotent as restarts of the streaming query can apply the operation on the same batch of data multiple times.

• When merge is used in foreachBatch, the input data rate of the streaming query (reported through StreamingQueryProgress and visible in the notebook rate graph) may be reported as a multiple of the actual rate at which data is generated at the source. This is because merge reads the input data multiple times causing the input metrics to be multiplied. If this is a bottleneck, you can cache the batch DataFrame before merge and then uncache it after merge.

Abhishek K
  • 3,047
  • 1
  • 6
  • 19