0

​Hi Everyone,

I've requirement to read streaming data from Azure EventHub and dump it to blob location. As per the cost optimization, i cannot prefer either Stream Analytics or Spark Streaming. I can only go with Spark batch job, that i need to explore how to read data from Azure EventHub as a batch(preferably previous day's data) and dump it to blob. My Azure EventHub holds 4 days of data, i need to make sure that i should avoid duplicates every-time i read the data from Azure EventHub.

I'm planning to read the data from azure event-hub once in a day using spark, is there a way i can maintain some sequence every time i read the data so to avoid duplicates.

Any help would be greatly appreciated.

chaitra k
  • 371
  • 1
  • 4
  • 18

3 Answers3

2

The Azure client libraries for Event Hubs have an EventProcessor. This processor processes events from supports a checkpoint store that persists information about what events have been processed. Currently, there is one implementation of a checkpoint store that persists checkpoint data to Azure Storage Blobs.

Here is the API documentation for the languages I know it is supported in. There are also samples in the GitHub repository and samples browser.

If you are looking for just transferring events into "a blob location", Event Hubs supports capture into Azure Storage Blobs.

Connie Yau
  • 670
  • 5
  • 12
0

If stream process is all about dumping events to Azure Storage then you should consider enabling capture instead where service can dump events to your choice of storage account as events arrive. https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-capture-overview

Serkant Karaca
  • 1,896
  • 9
  • 8
-1

In a brief, I've achieved this by Spark Structured Streaming + Trigger.Once.

processedDf
  .writeStream
  .trigger(Trigger.Once)
  .format("parquet")
  .option("checkpointLocation", "s3-path-to-checkpoint")
  .start("s3-path-to-parquet-lake")
chaitra k
  • 371
  • 1
  • 4
  • 18