When we use spark to write out files on AWS s3 or Azure blob storage, we can simply write:
df.write.parquet("/online/path/folder")
then the contents will be written to hundreds of files under the specified folder, like this:
/online/path/folder/f-1
/online/path/folder/f-2
...
/online/path/folder/f-100
My question is since the write is executed on tens or hundreds of sparks executors simultaneously, how do they avoid writting to the same file? Another important question is what is some executor failed and restarted? Will that restarted executor write to the same file before it failed?