We are seeing that our delta writers that append data to a Delta lake are taking increasingly long time to write their data. For relative small sets of data (single megabytes), write times will eventually be in the range of minutes on an Azure Delta Lake gen2 destination.
A small repro case here
import pandas as pd
import deltalake
from deltalake.writer import write_deltalake
from time import time
t0 = time()
tp = t0
for i in range(0, 1000):
dfs = []
foo = []
bar = []
car = []
for j in range(0, 1000):
foo.append(i)
bar.append(j)
car.append(f"{i} {j}")
df = pd.DataFrame({"foo": foo, "bar": bar, "car": car})
write_deltalake('path/to/table', df, mode='append', partition_by=["foo"])
tn = time()
print(i, tn-tp, (tn-t0)/(i+1))
tp = tn
After just a small data set, write times will be ten times higher. The partition is set so that a fixed number of files should be touched regardless of previous data. Note that this is local file access, so looks like somehow Delta accesses an increasing number of files.
Note that partitioning is done by the column with a unique value with every write.
Am I missing something fundamental here?