I was just going in depth through spark Delta transaction log and metrics that it stores. While on that analysis , I have noticed that whenever I am writing a spark dataframe as a delta table (I am writing in to an azure Gen2 storage) , it is creating a _tmp_path_dir inside the _delta_log directory along with usual .crc and .json files.
I am not understanding what is the role of this _tmp_path_dir. Is this some kind of temp directory where data or checkpoint data is kept until the writing process is completed ?
I couldn't get any resources online explaining why it is there ? Anyone's help is highly appreciated.
sample spark code which I have used for testing: code executed from databricks 9.1 LTS with spark 3.1.2 and delta lake version 1.0.0
df=spark.createDataFrame(
[(1,'test1',33,'Q1'),
(2,'test2',48,'Q1'),
(3,'test3',22,'Q1'),
(4,'test4',88,'Q2'),
(5,'test5',None,'Q2'),
(6,'test6',42,'Q2')
],'id int,test_name string,score int,quarter string'
)
df_to_write=df.repartition(1,['id'])
df_to_write.write.partitionBy('quarter').format('delta') \
.save('abfss://test_container@test_account.dfs.core.windows.net/global/test_data/delta_test/sample1')
What I can see after writing the data to azure gen2 delta lake: