What am I trying to do?
We use PySpark in our project and want to store our data in Amazon S3,
but writing to S3 with PySpark using
pyspark.sql.DataFrame.write
with mode="overwrite"
don't overwrite data in S3 correctly, if there was already a file under the url, where PySpark writes.
Steps to reproduce this behavior:
# 0. Import libraries and initialize spark
import pandas as pd
import awswrangler as wr
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config('spark.hadoop.fs.s3a.aws.credentials.provider', 'com.amazonaws.auth.profile.ProfileCredentialsProvider') \
.config('spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled', 'true') \
.getOrCreate()
# Output url for PySpark, use the `s3a://` scheme, as it's supported by Hadoop
output_url = f's3a://{MY_BUCKET}/test.csv'
# Output url with the same path for awswrangler, use the `s3://` scheme, as it's supported by awswrangler
wr_output_url = output_url.replace('s3a:', 's3:')
# 1. Write something to the output url using pyspark
df = spark.createDataFrame([{'Key': 'OldFoo'}, {'Key': 'OldBar'}], ['Key'])
df.write.csv(output_url)
# 2. Write something to the output url using awswrangler
wr.s3.to_csv(pd.DataFrame([{'SomeKey': 'SomeValue'} ]), wr_output_url)
# 3. Write our data to the output url using pyspark with mode('overwrite')
df = spark.createDataFrame([{'Key': 'Foo'}, {'Key': 'Bar'}], ['Key'])
df.write.mode('overwrite').csv(output_url)
# 4. Read data from this output url
spark.read.csv(output_url).show()
# Expected output of this show() is the data, that we have written in the last write.mode('overwrite')
# (and no other data, because 'overwrite' was used), i.e. DataFrame of two lines: ['Foo', 'Bar']
# Actual output is:
# +------+
# | _c0|
# +------+
# |OldFoo|
# |OldBar|
# | Foo|
# | Bar|
# +------+
By the actual output one can see both the data that we have written (using mode('overwrite')
)
and some other data (the data that were written in the previous call of write()
with PySpark)
This means that by such way we cannot be sure that out data is written properly, and it is a problem.
(We use pyspark==3.3.2
with hadoop 3.3.2
and run it via
spark-submit --packages=org.apache.hadoop:hadoop-aws:3.3.2,org.apache.spark:spark-hadoop-cloud_2.13:3.3.2,com.amazonaws:aws-java-sdk-bundle:1.11.655 ~/test.py
)
What have I tried to research?
Look at the S3 content before and after the write attempts ("files under files" seem not to be properly deleted)
If I understand correctly, this has to do with the specific of S3,
namely that S3 is a key-value storage and not a filesystem,
that is not taken into account in the write.mode('overwrite')
by PySpark.
Here is the content of our S3 output url after each write:
# 1. Write something to the output url using pyspark
df = spark.createDataFrame([{'Key': 'OldFoo'}, {'Key': 'OldBar'}], ['Key'])
df.write.csv(output_url)
print('\n'.join(wr.s3.list_objects(wr_output_url+'*')))
# prints:
# s3://<MY_BUCKET>/test.csv/_SUCCESS
# s3://<MY_BUCKET>/test.csv/part-00000-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
# s3://<MY_BUCKET>/test.csv/part-00001-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
# s3://<MY_BUCKET>/test.csv/part-00003-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
# 2. Write something to the output url using awswrangler
wr.s3.to_csv(pd.DataFrame([{'SomeKey': 'SomeValue'} ]), wr_output_url)
print('\n'.join(wr.s3.list_objects(wr_output_url+'*')))
# prints:
# s3://<MY_BUCKET>/test.csv
# s3://<MY_BUCKET>/test.csv/_SUCCESS
# s3://<MY_BUCKET>/test.csv/part-00000-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
# s3://<MY_BUCKET>/test.csv/part-00001-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
# s3://<MY_BUCKET>/test.csv/part-00003-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
# 3. Write our data to the output url using pyspark with mode('overwrite')
df = spark.createDataFrame([{'Key': 'Foo'}, {'Key': 'Bar'}], ['Key'])
df.write.mode('overwrite').csv(output_url)
print('\n'.join(wr.s3.list_objects(wr_output_url+'*')))
# prints:
# s3://<MY_BUCKET>/test.csv/_SUCCESS
# s3://<MY_BUCKET>/test.csv/part-00000-503a773b-4f7d-4089-9bce-f87bf56eb3df-c000.csv
# s3://<MY_BUCKET>/test.csv/part-00000-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
# s3://<MY_BUCKET>/test.csv/part-00001-503a773b-4f7d-4089-9bce-f87bf56eb3df-c000.csv
# s3://<MY_BUCKET>/test.csv/part-00001-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
# s3://<MY_BUCKET>/test.csv/part-00003-503a773b-4f7d-4089-9bce-f87bf56eb3df-c000.csv
# s3://<MY_BUCKET>/test.csv/part-00003-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
After the write.mode('overwrite')
there is both new files
s3://<MY_BUCKET>/test.csv/part-*-503a773b-*
and the old files
s3://<MY_BUCKET>/test.csv/part-*-52358825-*
(that seem to contain the OldFoo
data),
that means these old files were not deleted properly despite the mode('overwrite')
.
Before the write.mode('overwrite')
is called, there is both the file
s3://<MY_BUCKET>/test.csv
and some files under this file, for example,
s3://<MY_BUCKET>/test.csv/part-00000-52358825-0cf7-4609-81b1-2819d4205d85-c000.csv
(these seem to be files, that were written by the previous call of write
to this url by PySpark),
and write.mode('overwrite')
seems to delete only the file s3://<MY_BUCKET>/test.csv
and not to delete files under it.
(If write.mode('overwrite')
is called after other write
via PySpark,
when there is some files in the "directory" s3://<MY_BUCKET>/test.csv/
,
but no file s3://<MY_BUCKET>/test.csv
, it seems to run correctly)
In a "normal" file system such files under a file are impossible, however S3 (as a key-value storage) allows this, therefore it is to expect that libraries working with S3 can deal with it.
Look at the Hadoop sources
I'm not sure how exactly pyspark.sql.DataFrame.write uses hadoop libraries,
but if it uses
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/DeleteOperation.java#L166
,
this seems to be the dealing with files/directories similar to the logic that I have described above
(no processing files under files):
If the object is a directory,
than delete its content:
otherwise delete it as a "simple file",
i.e. without deleting files under files.
Dirty workaround
Instead of using write.mode('overwrite')
I could delete files under the output url (if there are some files) before each write
(i.e. delete them using awswrangler.s3.delete_objects
),
but this doesn't look like best practices.
My questions are:
How can I write/overwrite data in the S3 using PySpark avoiding this problem, i.e. without these doubts about correctness of the written data?
Do I understand correctly this logic, how
write.mode('overwrite')
in PySpark deals with files/directories in S3? (And if so, was this done intentionally and what is the reason?)