We use PySpark in our project and want to store our data in Amazon S3. We need to read/write/overwrite data with PySpark and make other operations with files in this bucket in S3 (upload/download/copy/move/...)
My questions are:
Which Python libraries can and should be used to work with S3 compatible with PySpark? What are best practices for this?
What have I tried
I have found many answers, where AWS SDK is used (e.g. 1, 2, 3 ), but it does not seem to be safe: as I have learned by this my question, there is no way to use awswrangler and PySpark safely by the same paths in S3 (because awswrangler does not respect directory markers used by PySpark/Hadoop, data written with PySpark can be silently corrupted)
It seems to be possible to use JVM gateway in PySpark (using something like
sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jvm.java.net.URI.create(f"s3a://{MY_BUCKET}"), sc._jsc.hadoopConfiguration())
), but it is unclear if it is a proper way to use PySpark.Some other tools by Apache Software Foundation seem to be able to work with files in S3 (such as S3FileSystem by Apache Arrow or Apache Airflow ), but I have doubts if it belong to best practices to use them with PySpark for operations with files.
I'm not sure if Hadoop WebHDFS is suitable, but it seems not to have official python library.