0

New data is being pushed to Event Hub frequently and I wish to read these updates, apply transformations (joins, select..) to those and then update an already existing delta table. Currently I am only working with a non streaming DataFrame with just a starting timestamp defined for Event Hub inside eventhubParameters to limit what I would read first before setting up some kind of checkpoint:

val df = spark.read
.format("eventhubs")
.options(eventhubParameters)
.load
// plus extracting data from body...

I then do all my transformations I want to df and I wish to append the result to my delta table.
My question is how could I ensure that I only append the new data with some kind of checkpointing perhaps? I would like to avoid using structured streaming and the use of foreachBatch function as I kinda see that being too complex for me in Scala and unnecessary for this task as well.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • why are you using non-streaming operation on streaming data? – Alex Ott May 26 '23 at 18:49
  • Could you please expand on why is it a bad solution? If I don't use readStream and have a streaming DataFrame, what is the downside? @AlexOtt – Tamás Godányi May 28 '23 at 07:29
  • The downside is that you need to track what you already processed and what not… – Alex Ott May 28 '23 at 07:58
  • And in that case isn't there a simple solution like chekckpointing I could use? My concern is that I am new to Scala and I'm having troubles figuring out foreachBatch function and how to use it with writeStream – Tamás Godányi May 28 '23 at 08:03
  • Streaming checkpoints are working with the streaming Dataframes… But really, you can do joins in streaming as well, but all depends on the processing logic you need – Alex Ott May 28 '23 at 08:07
  • I was referring to non streaming checkpoints. How can I join a streaming df and a static delta table? – Tamás Godányi May 28 '23 at 08:18
  • Non-streaming checkpoints are used for another functionality. You can do the join just the same way as non-streaming delta. `sdf = spark.readStream....; bdf = spark.read....; sdf.join(bdf, ...)` – Alex Ott May 28 '23 at 10:05

0 Answers0