Delta lake and ADLS Gen2 transactions

Question

We are running a Delta lake on ADLS Gen2 with plenty of tables and Spark jobs. The Spark jobs are running in Databricks and we mounted the ADLS containers into DBFS (abfss://delta@<our-adls-account>.dfs.core.windows.net/silver). There's one container for each "tier", so bronze, silver, gold.

This setup has been stable for some months now, but last week, we've seen a sudden increase in transactions within our storage account, particularly in the ListFilesystemDir operations:

We've added some smaller jobs that read and write some data in that time frame, but turning them off did not reduce the amount of transactions back to the old level.

Two questions regarding this:

Is there some sort of documentation that explains which operation on a Delta table causes which kind of ADLS transactions?
Is it possible to find out which container/directory/Spark job/... causes this amount of transactions, without turning off the Spark jobs one by one?

yes, structured streaming mostly, but there are also some batch jobs — hbrgnr, May 12 '21 at 14:38
you mean ".trigger(Trigger.ProcessingTime("1 minute"))" for example? none, mostly, but that's because the batch duration is generally quite large (>5 minutes per batch) — hbrgnr, May 12 '21 at 14:43
I thought about this item: https://learn.microsoft.com/en-us/azure/databricks/release-notes/runtime/8.0#new-structured-streaming-default-trigger-interval-reduces-costs — Alex Ott, May 12 '21 at 14:46
"reduce costs for cloud storage such as listing" - nice, that's a very good point! will review all the jobs and apply triggers, thanks a lot! — hbrgnr, May 12 '21 at 14:51
@AlexOtt thanks! The result is amazing! List operations are down to 1/3. — Steffen Mangold, May 12 '21 at 17:39
I'm glad that it helped. It's not necessary that you upgrade to the latest version, you can set trigger in any of versions. In the latest version it's just set as default — Alex Ott, May 12 '21 at 18:09

score 1 · Answer 1 · answered May 12 '21 at 13:36

1

If you go into logs from your data lake (if you have log analytics enabled) you can view the exact timestamp, caller and target of the spike. Take that data and go into your databricks cluster and navigate to Spark UI. In there you should be able to see timestamps and jobs. There you can find what notebook is causing it.

answered May 12 '21 at 13:36

54m

719
2
7
18

ah nice, I didn't know that log analytics tracks all this, will give it a shot – hbrgnr May 12 '21 at 14:51

Delta lake and ADLS Gen2 transactions

1 Answers1