1

I have a function to replace parquet files in ADLS Gen2:

def replace_parquet_file(df: DataFrame, path: str):
  path_new = path + '_new'
  path_old = path + '_old'
  
  if not file_exists(path):
    df.write.mode('overwrite').parquet(path)
  else:  
    df.write.parquet(path_new)

    dbutils.fs.mv(path, path_old, recurse = True)
    dbutils.fs.mv(path_new, path, recurse = True)
    dbutils.fs.rm(path_old, recurse = True)

This function keeps failing at the second mv() operation with the following error:

 135 
    136     dbutils.fs.mv(path, path_old, recurse = True)
--> 137     dbutils.fs.mv(path_new, path, recurse = True)
    138     dbutils.fs.rm(path_old, recurse = True)
    139 

java.io.FileNotFoundException: Failed with java.io.FileNotFoundException while processing file/directory :[/tmp/myfile.parquet/_committed_6614329046368392922] in method:[Operation failed: "The specified path does not exist.", 404, PUT, https://XXX.dfs.core.windows.net/XXX/tmp/myfile.parquet/_committed_6614329046368392922?action=append&position=0&timeout=90, PathNotFound, "The specified path does not exist. ...] 

This is obviously a partition, written by spark. As this is the second mv() operation, my assumption would be spark has already finished writing all partitions. When I look at the ADLS myself, the file clearly exists.

I do not understand, what happens here? Is the write process by spark not finished at the moment the mv() operation ist started or is this more of an error with the ADLS API?

Best, Jan

Hanebambel
  • 109
  • 11
  • Did you try to execute these commands `dbutils` separated? – Kafels Aug 18 '21 at 12:24
  • What do you mean by seperated? AFAIK they are not async, so the should be executed sequentially?!? – Hanebambel Aug 18 '21 at 13:32
  • Just asking to do a debugging session, run the first line and look how is your ADLS, run the second line, do the same thing, etc. I'm suspecting the `recurse=True` moves the parent folder – Kafels Aug 18 '21 at 13:39
  • 1
    Okay, did it step by step and checked after each step. Result: it just worked! Looks more more and more like a ADLS API bug or a timing problem to me – Hanebambel Aug 19 '21 at 04:39
  • @Hanebambel As you are able to resolve the issue. Did you like to post as an answer? – CHEEKATLAPRADEEP Aug 23 '21 at 08:40
  • No, I could not resolve it. I just can't reproduce it either. It happens from time to time. So my best guess is still some kind of race condition within the ADLS Gen2 API – Hanebambel Aug 25 '21 at 08:22
  • According to a colleague who talked to Databricks this seem to be a network issue in Azure: https://learn.microsoft.com/en-gb/azure/architecture/best-practices/transient-faults – Hanebambel Sep 08 '21 at 06:13

0 Answers0