0

I was just going in depth through spark Delta transaction log and metrics that it stores. While on that analysis , I have noticed that whenever I am writing a spark dataframe as a delta table (I am writing in to an azure Gen2 storage) , it is creating a _tmp_path_dir inside the _delta_log directory along with usual .crc and .json files.

I am not understanding what is the role of this _tmp_path_dir. Is this some kind of temp directory where data or checkpoint data is kept until the writing process is completed ?

I couldn't get any resources online explaining why it is there ? Anyone's help is highly appreciated.

sample spark code which I have used for testing: code executed from databricks 9.1 LTS with spark 3.1.2 and delta lake version 1.0.0

df=spark.createDataFrame( 
  [(1,'test1',33,'Q1'), 
   (2,'test2',48,'Q1'), 
   (3,'test3',22,'Q1'), 
   (4,'test4',88,'Q2'), 
   (5,'test5',None,'Q2'), 
   (6,'test6',42,'Q2') 
  ],'id int,test_name string,score int,quarter string' 
) 
df_to_write=df.repartition(1,['id']) 
df_to_write.write.partitionBy('quarter').format('delta') \
  .save('abfss://test_container@test_account.dfs.core.windows.net/global/test_data/delta_test/sample1') 

What I can see after writing the data to azure gen2 delta lake: enter image description here

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
akhil pathirippilly
  • 920
  • 1
  • 7
  • 25
  • I've actually never seen this before. Can you please past your code snippet, so I can see if I can reproduce? Also the Delta Lake version you're using. – Powers Oct 04 '22 at 16:57
  • @Powers : I have added the sample code and _delta_log directory snapshot to my post. – akhil pathirippilly Oct 05 '22 at 07:02

0 Answers0