Databricks - Reduce delta version compute time

Question

I've got a process which is really bogged down by the version computing for the target delta table.

Little bit of context - there are other things that run, all contributing uniform structured dataframes that I want to persist in a delta table. Ultimately these are all compiled into lots_of_dataframes to be logged.

for i in lots_of_dataframes:
    i.write.insertInto("target_delta_table")
    # ... take a while to compute version

I've got in to the documentation but couldn't find any setting to ignore the version compute. I did see vacuuming, but not sure that'll do the since there will still be a lot of activity in a small window of time.

I know that I can union all of the dataframes together and just do the insert once, but I'm wondering if there is a more Databricks-ian way to do it. Like a configuration to only maintain 1 version at a time and not worry about computing for a restore.

score 1 · Accepted Answer · answered Nov 03 '22 at 18:18

Most probably, but it's hard to say exactly without details, the problem arise from the following facts:

Spark is lazy - the actual data processing doesn't happen until you perform action, like writing data into a destination table. So if you have a lot of transformations, etc., they will happen when you're writing data.
You're writing data in the loop - you can potentially speedup it a bit by doing a union of all tables into a single dataframe, that will be written into one go:

import functools
unioned = functools.reduce(lambda x,y: x.union(y), lots_of_dataframes)
unioned.write.insertInto("target_delta_table")

Databricks - Reduce delta version compute time

1 Answers1