0

A source table in an SQL DB increments (new rows) every second.

I want to run some spark code (maybe with Structured Streaming?) once per day (it is okay if the copy is at most 1-day outdated), to append the new rows since the last time I ran the code. The copy would be a delta table on Databricks.

I'm not sure spark.readStream will work since the source table is not delta, rather JDBC (SQL)

Oliver Angelil
  • 1,099
  • 15
  • 31

2 Answers2

1

Structured Streaming doesn't support JDBC source: link

If you have a strictly increasing column in your source table, you can read it in batch mode and store your progress in the userMetadata in your target Delta table link.

boyangeor
  • 381
  • 3
  • 6
0

You can perform spark.readStream.format("delta"). You have to define a checkpoint, which will store all the metadata related to the streaming pipeline. Let's say you streamed to version 2 on your first run. On the next day when you restart the pipeline, even if your source table is at version 10, the stream will restart from version 3.

Tharun Kumar
  • 363
  • 3
  • 5