New data is being pushed to Event Hub frequently and I wish to read these updates, apply transformations (joins, select..) to those and then update an already existing delta table.
Currently I am only working with a non streaming DataFrame with just a starting timestamp defined for Event Hub inside eventhubParameters
to limit what I would read first before setting up some kind of checkpoint:
val df = spark.read
.format("eventhubs")
.options(eventhubParameters)
.load
// plus extracting data from body...
I then do all my transformations I want to df
and I wish to append the result to my delta table.
My question is how could I ensure that I only append the new data with some kind of checkpointing perhaps? I would like to avoid using structured streaming and the use of foreachBatch function as I kinda see that being too complex for me in Scala and unnecessary for this task as well.