-1

We use PySpark in our project and want to store our data in Amazon S3. We need to read/write/overwrite data with PySpark and make other operations with files in this bucket in S3 (upload/download/copy/move/...)

My questions are:

Which Python libraries can and should be used to work with S3 compatible with PySpark? What are best practices for this?

What have I tried

  • I have found many answers, where AWS SDK is used (e.g. 1, 2, 3 ), but it does not seem to be safe: as I have learned by this my question, there is no way to use awswrangler and PySpark safely by the same paths in S3 (because awswrangler does not respect directory markers used by PySpark/Hadoop, data written with PySpark can be silently corrupted)

  • It seems to be possible to use JVM gateway in PySpark (using something like sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jvm.java.net.URI.create(f"s3a://{MY_BUCKET}"), sc._jsc.hadoopConfiguration())), but it is unclear if it is a proper way to use PySpark.

  • Some other tools by Apache Software Foundation seem to be able to work with files in S3 (such as S3FileSystem by Apache Arrow or Apache Airflow ), but I have doubts if it belong to best practices to use them with PySpark for operations with files.

  • I'm not sure if Hadoop WebHDFS is suitable, but it seems not to have official python library.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
emu
  • 39
  • 3

1 Answers1

0

Spark includes hadoop-aws module, so the second bullet is "correct", however, you should ideally add that configuration (along with AWS SDK credentials for the bucket) in the core-site.xml file, rather than in-code.

Then, you don't need any Spark JVM config hooks, and you can just use s3a:// paths directly with Spark read/save actions.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245