Append only new data from Event Hub in Scala

Question

New data is being pushed to Event Hub frequently and I wish to read these updates, apply transformations (joins, select..) to those and then update an already existing delta table. Currently I am only working with a non streaming DataFrame with just a starting timestamp defined for Event Hub inside eventhubParameters to limit what I would read first before setting up some kind of checkpoint:

val df = spark.read
.format("eventhubs")
.options(eventhubParameters)
.load
// plus extracting data from body...

I then do all my transformations I want to df and I wish to append the result to my delta table.
My question is how could I ensure that I only append the new data with some kind of checkpointing perhaps? I would like to avoid using structured streaming and the use of foreachBatch function as I kinda see that being too complex for me in Scala and unnecessary for this task as well.

why are you using non-streaming operation on streaming data? — Alex Ott, May 26 '23 at 18:49
Could you please expand on why is it a bad solution? If I don't use readStream and have a streaming DataFrame, what is the downside? @AlexOtt — Tamás Godányi, May 28 '23 at 07:29
The downside is that you need to track what you already processed and what not… — Alex Ott, May 28 '23 at 07:58
And in that case isn't there a simple solution like chekckpointing I could use? My concern is that I am new to Scala and I'm having troubles figuring out foreachBatch function and how to use it with writeStream — Tamás Godányi, May 28 '23 at 08:03
Streaming checkpoints are working with the streaming Dataframes… But really, you can do joins in streaming as well, but all depends on the processing logic you need — Alex Ott, May 28 '23 at 08:07
I was referring to non streaming checkpoints. How can I join a streaming df and a static delta table? — Tamás Godányi, May 28 '23 at 08:18
Non-streaming checkpoints are used for another functionality. You can do the join just the same way as non-streaming delta. `sdf = spark.readStream....; bdf = spark.read....; sdf.join(bdf, ...)` — Alex Ott, May 28 '23 at 10:05

Append only new data from Event Hub in Scala

0 Answers0