0

I'm currently attempting to process telemetry data which has a volume of around 4TB a day using Delta Lake on Azure Databricks.

I have a dedicated event hub cluster where the events are written to and I am attempting to ingest this eventhub into delta lake with databricks structured streaming. there's a relatively simple job that takes the event hub output and extracts a few columns and then writes with a stream writer to ADLS gen2 storage that is mounted to the DBFS partitioned by date and hour.

Initially on a clean delta table directory the performance keeps up with the event hub writing around 18k records a second but after a few hours this drops to 10k a second and then further till it seems to stabilize around 3k records a second.

tried a few things on the databricks side with different partition schemes and the day hour partitions seemed to perform the best for the longest but still, after a pause and restart in this case the performance dropped and started to lag behind the event hub.

looking for some suggestions as to how I might be able to maintain performance.

ZedZim
  • 53
  • 1
  • 9
  • As a further step changing to Straight parquet over Delta appears to keep up with the event hub also. but i would prefer to use Delta if possible. – ZedZim Feb 10 '21 at 16:48
  • you need to provide more information - look into the statistics for each batch, like, how much time it's required for processing, etc. Also, do you have autooptimize & autocompaction enabled on your delta table? look here: https://docs.databricks.com/delta/optimizations/auto-optimize.html – Alex Ott Feb 27 '21 at 09:16

1 Answers1

2

I had a similar issue once, and it was not the Delta lake, but the Spark Azure EventHubs connector. It was extremely slow and using up a lot of resources.

I solved this problem by switching to the Kafka interface of Azure EventHubs: https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview

It's a little tricky to set up but it has been working very well for a couple of months now.

hbrgnr
  • 410
  • 3
  • 13
  • 1
    can confirm this observation - Kafka connector work good with EventHubs (you need Standard tier), plus it's more flexible – Alex Ott Mar 17 '21 at 07:30