I have a function to replace parquet files in ADLS Gen2:
def replace_parquet_file(df: DataFrame, path: str):
path_new = path + '_new'
path_old = path + '_old'
if not file_exists(path):
df.write.mode('overwrite').parquet(path)
else:
df.write.parquet(path_new)
dbutils.fs.mv(path, path_old, recurse = True)
dbutils.fs.mv(path_new, path, recurse = True)
dbutils.fs.rm(path_old, recurse = True)
This function keeps failing at the second mv() operation with the following error:
135
136 dbutils.fs.mv(path, path_old, recurse = True)
--> 137 dbutils.fs.mv(path_new, path, recurse = True)
138 dbutils.fs.rm(path_old, recurse = True)
139
java.io.FileNotFoundException: Failed with java.io.FileNotFoundException while processing file/directory :[/tmp/myfile.parquet/_committed_6614329046368392922] in method:[Operation failed: "The specified path does not exist.", 404, PUT, https://XXX.dfs.core.windows.net/XXX/tmp/myfile.parquet/_committed_6614329046368392922?action=append&position=0&timeout=90, PathNotFound, "The specified path does not exist. ...]
This is obviously a partition, written by spark. As this is the second mv() operation, my assumption would be spark has already finished writing all partitions. When I look at the ADLS myself, the file clearly exists.
I do not understand, what happens here? Is the write process by spark not finished at the moment the mv() operation ist started or is this more of an error with the ADLS API?
Best, Jan