why a _tmp_path_dir is created when we write a spark dataframe as delta table

Question

I was just going in depth through spark Delta transaction log and metrics that it stores. While on that analysis , I have noticed that whenever I am writing a spark dataframe as a delta table (I am writing in to an azure Gen2 storage) , it is creating a _tmp_path_dir inside the _delta_log directory along with usual .crc and .json files.

I am not understanding what is the role of this _tmp_path_dir. Is this some kind of temp directory where data or checkpoint data is kept until the writing process is completed ?

I couldn't get any resources online explaining why it is there ? Anyone's help is highly appreciated.

sample spark code which I have used for testing: code executed from databricks 9.1 LTS with spark 3.1.2 and delta lake version 1.0.0

df=spark.createDataFrame( 
  [(1,'test1',33,'Q1'), 
   (2,'test2',48,'Q1'), 
   (3,'test3',22,'Q1'), 
   (4,'test4',88,'Q2'), 
   (5,'test5',None,'Q2'), 
   (6,'test6',42,'Q2') 
  ],'id int,test_name string,score int,quarter string' 
) 
df_to_write=df.repartition(1,['id']) 
df_to_write.write.partitionBy('quarter').format('delta') \
  .save('abfss://test_container@test_account.dfs.core.windows.net/global/test_data/delta_test/sample1')

What I can see after writing the data to azure gen2 delta lake:

I've actually never seen this before. Can you please past your code snippet, so I can see if I can reproduce? Also the Delta Lake version you're using. — Powers, Oct 04 '22 at 16:57
@Powers : I have added the sample code and _delta_log directory snapshot to my post. — akhil pathirippilly, Oct 05 '22 at 07:02

why a _tmp_path_dir is created when we write a spark dataframe as delta table

0 Answers0