Copy of Incremental source table with Spark

Question

A source table in an SQL DB increments (new rows) every second.

I want to run some spark code (maybe with Structured Streaming?) once per day (it is okay if the copy is at most 1-day outdated), to append the new rows since the last time I ran the code. The copy would be a delta table on Databricks.

I'm not sure spark.readStream will work since the source table is not delta, rather JDBC (SQL)

score 1 · Accepted Answer · answered Jul 18 '23 at 05:23

1

Structured Streaming doesn't support JDBC source: link

If you have a strictly increasing column in your source table, you can read it in batch mode and store your progress in the userMetadata in your target Delta table link.

answered Jul 18 '23 at 05:23

boyangeor

381
3
6

Thanks - what is batch mode? Is that an option? – Oliver Angelil Jul 18 '23 at 05:44
1

Just a plain `spark.read.format("jdbc")` – boyangeor Jul 18 '23 at 05:54

score 0 · Answer 2 · answered Jul 17 '23 at 11:46

0

You can perform spark.readStream.format("delta"). You have to define a checkpoint, which will store all the metadata related to the streaming pipeline. Let's say you streamed to version 2 on your first run. On the next day when you restart the pipeline, even if your source table is at version 10, the stream will restart from version 3.

answered Jul 17 '23 at 11:46

Tharun Kumar

363
3
5

Adding a link to the relevant Databricks page: https://docs.databricks.com/structured-streaming/delta-lake.html – Oliver Angelil Jul 17 '23 at 11:55
don't think your proposal is possible: https://stackoverflow.com/a/71274909/5392289 – Oliver Angelil Jul 17 '23 at 15:14

Copy of Incremental source table with Spark

2 Answers2

Linked