1

I Have a requirement where I need to ingest continuous/steam data(Json format) from eventHub to Azure data lake. I want to follow the layered approach(raw, clean, prepared) to finally store data into delta table. My doubt is around the raw layer. out of below two approach which one do you recommend is best.

  1. Event hub -> RawLayer(Raw Json Format) -> cleanLayer (delta table) -> preparedLayer(delta table)
  2. Event hub -> RawLayer(delta table) -> cleanLayer (delta table) -> preparedLayer(delta table)

so shall I store the raw Json format in raw layer or its suggested to create delta table in Raw layer is well.

Regards,

Deepak
  • 31
  • 3

1 Answers1

0

I will let others debate the theoretical approaches.

From a practical standpoint, here are the most common ways to write to disk from Event Hub:

  • Event Hub Capture, dumps files to a storage account directly from an event hub, but the format is AVRO. This is not practical, but it is the "rawest" form your records can take. If I remember correctly your payload is encoded in base64 and embedded in a common schema. They have guidance on how to extract your data in Spark.
  • Azure Stream Analytics can output to JSON or parquet. In both cases, events are actually going through a deserialization / serialization process that can't be bypassed. This means the output will look raw (at least in the JSON case) but won't really be. In this scenario ASA should be seen as a streaming ETL/ETL. Don't use it (and pay for it) if you're not actively using its features (transformation, cleaning, enrichment...). Note that ASA doesn't support delta lake as an output yet - so you will still need some post processing to ingest the generated files.
  • Azure Functions using the proper bindings, but as ASA it will require deserialization that don't really qualify as "raw", unless you take a similar approach to what's done in EH Capture to which point you should just use Capture.
Florian Eiden
  • 832
  • 5
  • 9