Write data in S3 using PySpark: after write.mode('overwrite') wrong data is stored

Question

What am I trying to do?

We use PySpark in our project and want to store our data in Amazon S3, but writing to S3 with PySpark using pyspark.sql.DataFrame.write with mode="overwrite" don't overwrite data in S3 correctly, if there was already a file under the url, where PySpark writes.

Steps to reproduce this behavior:

# 0. Import libraries and initialize spark
    import pandas as pd
    import awswrangler as wr
    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
        .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'com.amazonaws.auth.profile.ProfileCredentialsProvider') \
        .config('spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled', 'true') \
        .getOrCreate()
    # Output url for PySpark, use the `s3a://` scheme, as it's supported by Hadoop
    output_url = f's3a://{MY_BUCKET}/test.csv'
    # Output url with the same path for awswrangler, use the `s3://` scheme, as it's supported by awswrangler
    wr_output_url = output_url.replace('s3a:', 's3:')

# 1. Write something to the output url using pyspark
    df = spark.createDataFrame([{'Key': 'OldFoo'}, {'Key': 'OldBar'}], ['Key'])
    df.write.csv(output_url)

# 2. Write something to the output url using awswrangler
    wr.s3.to_csv(pd.DataFrame([{'SomeKey': 'SomeValue'} ]), wr_output_url)

# 3. Write our data to the output url using pyspark with mode('overwrite')
    df = spark.createDataFrame([{'Key': 'Foo'}, {'Key': 'Bar'}], ['Key'])
    df.write.mode('overwrite').csv(output_url)

# 4. Read data from this output url
    spark.read.csv(output_url).show()

# Expected output of this show() is the data, that we have written in the last write.mode('overwrite')
# (and no other data, because 'overwrite' was used), i.e. DataFrame of two lines: ['Foo', 'Bar']
# Actual output is:
#    +------+
#    |   _c0|
#    +------+
#    |OldFoo|
#    |OldBar|
#    |   Foo|
#    |   Bar|
#    +------+

By the actual output one can see both the data that we have written (using mode('overwrite')) and some other data (the data that were written in the previous call of write() with PySpark)

This means that by such way we cannot be sure that out data is written properly, and it is a problem.

(We use pyspark==3.3.2 with hadoop 3.3.2 and run it via spark-submit --packages=org.apache.hadoop:hadoop-aws:3.3.2,org.apache.spark:spark-hadoop-cloud_2.13:3.3.2,com.amazonaws:aws-java-sdk-bundle:1.11.655 ~/test.py )

What have I tried to research?

Look at the S3 content before and after the write attempts ("files under files" seem not to be properly deleted)

If I understand correctly, this has to do with the specific of S3, namely that S3 is a key-value storage and not a filesystem, that is not taken into account in the write.mode('overwrite') by PySpark.
Here is the content of our S3 output url after each write:

# 1. Write something to the output url using pyspark
    df = spark.createDataFrame([{'Key': 'OldFoo'}, {'Key': 'OldBar'}], ['Key'])
    df.write.csv(output_url)
    print('\n'.join(wr.s3.list_objects(wr_output_url+'*')))
# prints:
#  s3://<MY_BUCKET>/test.csv/_SUCCESS
#  s3://<MY_BUCKET>/test.csv/part-00000-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
#  s3://<MY_BUCKET>/test.csv/part-00001-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
#  s3://<MY_BUCKET>/test.csv/part-00003-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv

# 2. Write something to the output url using awswrangler
    wr.s3.to_csv(pd.DataFrame([{'SomeKey': 'SomeValue'} ]), wr_output_url)
    print('\n'.join(wr.s3.list_objects(wr_output_url+'*')))
# prints:
#  s3://<MY_BUCKET>/test.csv
#  s3://<MY_BUCKET>/test.csv/_SUCCESS
#  s3://<MY_BUCKET>/test.csv/part-00000-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
#  s3://<MY_BUCKET>/test.csv/part-00001-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
#  s3://<MY_BUCKET>/test.csv/part-00003-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv

# 3. Write our data to the output url using pyspark with mode('overwrite')
    df = spark.createDataFrame([{'Key': 'Foo'}, {'Key': 'Bar'}], ['Key'])
    df.write.mode('overwrite').csv(output_url)
    print('\n'.join(wr.s3.list_objects(wr_output_url+'*')))
# prints:
#  s3://<MY_BUCKET>/test.csv/_SUCCESS
#  s3://<MY_BUCKET>/test.csv/part-00000-503a773b-4f7d-4089-9bce-f87bf56eb3df-c000.csv
#  s3://<MY_BUCKET>/test.csv/part-00000-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
#  s3://<MY_BUCKET>/test.csv/part-00001-503a773b-4f7d-4089-9bce-f87bf56eb3df-c000.csv
#  s3://<MY_BUCKET>/test.csv/part-00001-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
#  s3://<MY_BUCKET>/test.csv/part-00003-503a773b-4f7d-4089-9bce-f87bf56eb3df-c000.csv
#  s3://<MY_BUCKET>/test.csv/part-00003-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv

After the write.mode('overwrite') there is both new files s3://<MY_BUCKET>/test.csv/part-*-503a773b-* and the old files s3://<MY_BUCKET>/test.csv/part-*-52358825-* (that seem to contain the OldFoo data), that means these old files were not deleted properly despite the mode('overwrite').

Before the write.mode('overwrite') is called, there is both the file
s3://<MY_BUCKET>/test.csv and some files under this file, for example,
s3://<MY_BUCKET>/test.csv/part-00000-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv (these seem to be files, that were written by the previous call of write to this url by PySpark), and write.mode('overwrite') seems to delete only the file s3://<MY_BUCKET>/test.csv and not to delete files under it.

(If write.mode('overwrite') is called after other write via PySpark, when there is some files in the "directory" s3://<MY_BUCKET>/test.csv/, but no file s3://<MY_BUCKET>/test.csv, it seems to run correctly)

In a "normal" file system such files under a file are impossible, however S3 (as a key-value storage) allows this, therefore it is to expect that libraries working with S3 can deal with it.

Look at the Hadoop sources

I'm not sure how exactly pyspark.sql.DataFrame.write uses hadoop libraries, but if it uses
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/DeleteOperation.java#L166 , this seems to be the dealing with files/directories similar to the logic that I have described above (no processing files under files):
If the object is a directory, than delete its content: otherwise delete it as a "simple file", i.e. without deleting files under files.

Dirty workaround

Instead of using write.mode('overwrite') I could delete files under the output url (if there are some files) before each write (i.e. delete them using awswrangler.s3.delete_objects), but this doesn't look like best practices.

My questions are:

How can I write/overwrite data in the S3 using PySpark avoiding this problem, i.e. without these doubts about correctness of the written data?
Do I understand correctly this logic, how write.mode('overwrite') in PySpark deals with files/directories in S3? (And if so, was this done intentionally and what is the reason?)

Sounds like [SPARK-35299](https://issues.apache.org/jira/browse/SPARK-35299), and its related Hadoop JIRA's (mentioned towards the end). — mazaneicha, Mar 10 '23 at 16:26
files under a file are impossible, however S3 (as a key-value storage) allows this -> test.csv is a directory if you create it with the pyspark. spark will not create single file but multiple files under the output url. So when you create an output with spark, the naming for test.csv is really bad idea. — Lamanus, Mar 11 '23 at 10:59
Thanks @mazaneicha, SPARK-35299 seems to be similar to my problem. The related Hadoop tickets (HADOOP-13230 and HADOOP-17199) are "Resolved/Fixed", I'm not sure, if I understand correctly, what the "Resolved/Incomplete" status of [SPARK-35299](https://issues.apache.org/jira/browse/SPARK-35299) means, but it looks like after that releases of Hadoop this problem should be solved. We use PySpark with Hadoop 3.3.2 that is higher that 'Fix Version/s' in these Hadoop tickets, however the problem still remains. — emu, Mar 14 '23 at 09:38
Yes, spark create multiple files under the output url. The naming for `.csv` may be not a good idea, but it as not so rarely (for instance, in [Apache Spark "getting_started"](https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart_df.ipynb): `df.write.csv('foo.csv', header=True)` ). And I think that in any case the naming should not lead to data corruption. — emu, Mar 14 '23 at 09:43

score 1 · Answer 1 · answered Mar 11 '23 at 14:29

The whole ASF big data stack really doesn't like people trying to create directories under files; or files under files -but it doesn't really stop them. The reason the S3A connector doesn't do absolute due diligence on create(), delete() etc is that it would really hurt performance to avoid a hypothetical problem that rarely ever shows up in production because everyone's applications contain that implicit set of assumptions about the store being "a real fast system"

If you try hard you really can blow that metaphor up, at which point you are on your own. I think you have managed to do that here.

The interesting question is what can be done here? The S3A committers could add an extra HEAD in job setup which identifies this and blows up fast. Looking at it though, it does call mkdirs(), which should be detecting the problem. Except there, the MkdirOperation calls getPathStatusExpectingDir(dir): it issues a LIST before the HEAD, gets the listing and assumes all is well -it never issues the HEAD which would have found the file.

I'm not sure what can be done here in terms of adding extra resilience against all the possible combinations of things which can effectively generate an invalid "filesystem" without bringing everything to its knees. job setup would have to walk up the ancestors to make sure there are no files at parent entries ... but then what about job commit? Should we repeat that? Every time we create a file?. Because the s3a magic committer integration in hadoop 3.3.5 is going the other way and doing even fewer safety checks when you create files under _magic paths, such as removing the LIST call always made in create().

I think the solution to your problem is "don't do that then"

View it a bit like the way on performance cars that you can turn off all the stability controls if you press and hold some traction control button for more than five seconds. The car makes a beep, a new icon appears on the dashboard and then when you oversteer on a corner in an alpine pass and you and your passengers all die screaming the car manufacturer can look at the audit log of the vehicle and say "they chose this" and so avoid all lawsuits which would otherwise arise.

What do you mean by "don't do that then"? I thought that `df.write.mode('overwrite').csv(output_url)` is not (in your metaphor) "turn off all the stability controls", is it? And using a library by Amazon (namely awswrangler) to work with Amazon storage (namely S3) seems also not to be bad. What should I do instead? — emu, Mar 14 '23 at 11:16
I mean "don't create objects which look like files above directories", where directory is defined as "a zero byte object ending in /" (a dir marker) or "a list of objects you get back on a LIST request. You've got the store into a state where the fact that "filesystem" is an illusion doesn't hold. — stevel, Mar 14 '23 at 13:23