1

I have a job scheduler which is running daily in Azure Databricks notebook and output generated to a parquet file in Databricks.

I am creating Azure Eventhub where daily output of the parquet table will be uploaded.

My question is lets say on day1 data is uploaded to Eventhub and on day2 when i am trying to upload the data it should only append the data of day2, it should not upload the data of day1 and day2 together again.

Can you help me with the sample code?

Saswat Ray
  • 141
  • 3
  • 14

1 Answers1

1

The simplest way to achieve this is to use Spark Structured Streaming with Trigger.Once for that - it will allow to track what files has changed since last invocation, and process only what has changed since that time (but it may depend on how changes are made, like, if it's overwrite vs. append, etc.)

In the simplest case it could be as simple as (in Python, but Scala version will be almost identical. For EventHubs parameters, etc. see its docs):

writeConnectionString = "YOUR.EVENTHUB.NAME"
ehWriteConf = {
  'eventhubs.connectionString' : writeConnectionString
}
spark.readStream \
  .format("parquet") \
  .load("/path/to/data") \
  .writeStream \
  .format("eventhubs") \
  .options(**ehWriteConf) \
  .option("checkpointLocation", "/path/to_checkpoint") \
  .trigger(once=True) \
  .start()

The actual solution could be more complex, as it depends on additional requirements that weren't described.

P.S. I would really recommend to use Delta as file format instead of Parquet.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • My requirement is not continuous streaming. The delta table which needs to be uploaded to eventhub is generating some chunks of data daily. I need to upload it to eventhub on daily basis. My question is how can i control the limit of data on daily basis. for eg. lets say the tables is generating 15 rows each day. On day1 i will upload all 15 rows but on day2 i need to upload 15 fresh rows of that day not whole 30 rows.I know from receiving side we have mechanism to control but from sending side how can we do it in python? – Saswat Ray Oct 13 '21 at 16:02
  • What I proposed isn't the continuous streaming, but a stream with option Trigger.Once, that behaves like a batch, but allows to track what you already processed, etc. It will take changes since the last invocation, process them, and finish after that. The main benefit is that Spark will track for you what was already processed and what isn't... – Alex Ott Oct 13 '21 at 16:06