What's the best way to write a big file to S3?

Question

I'm using zeppelin and spark, and I'd like to take a 2TB file from S3 and run transformations on it in Spark, and then send it up to S3 so that I can work with the file in Jupyter notebook. The transformations are pretty straightforward.

I'm reading the file as a parquet file. I think it's about 2TB, but I'm not sure how to verify.

It's about 10M row and 5 columns, so it's pretty big.

I tried to do my_table.write.parquet(s3path) and I tried my_table.write.option("maxRecordsPerFile", 200000).parquet(s3path). How do I come up with the right way to write a big parquet file?

10 million rows isn't necessarily big but my calculations suggest each column must have some large json blob or something? Can you give some more details on the data structure. — 9bO3av5fw5, May 15 '20 at 22:45

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

These are the points you could consider...

1) maxRecordsPerFile setting:

With

my_table.write.parquet(s3path)

Spark writes a single file out per task.

The number of saved files is = the number of partitions of the RDD/Dataframe being saved. Thus, this could result in ridiculously large files (of couse you can repartition your data and save repartition means shuffles the data across the networks.).

To limit number of records per file

my_table.write.option("maxRecordsPerFile", numberOfRecordsPerFile..yourwish).parquet(s3path)

It can avoid generating huge files.

2) If you are using AWS Emr (Emrfs) this could be one of the point you can consider.

emr-spark-s3-optimized-committer

When the EMRFS S3-optimized Committer is Not Used :

When using the S3A file system.
When using an output format other than Parquet, such as ORC or text.

3) Using compression techniques , algo version and other spark configurations:

.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 2)
.config("spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored", true)
.config("spark.hadoop.parquet.enable.summary-metadata", false)
.config("spark.sql.parquet.mergeSchema", false)
.config("spark.sql.parquet.filterPushdown", true) // for reading purpose 
.config("mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.sql.parquet.compression.codec", "snappy")
.getOrCreate()

4) fast upload and other props in case you are using s3a:

  .config("spark.hadoop.fs.s3a.fast.upload","true")
  .config("spark.hadoop.fs.s3a.fast.upload","true")
  .config("spark.hadoop.fs.s3a.connection.timeout","100000")
  .config("spark.hadoop.fs.s3a.attempts.maximum","10")
  .config("spark.hadoop.fs.s3a.fast.upload","true")
  .config("spark.hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
  .config("spark.hadoop.fs.s3a.fast.upload.active.blocks","4")
  .config("fs.s3a.connection.ssl.enabled", "true")

maxRecordsPerFile ... you already writing this in your question right ? — Ram Ghadiyaram, May 16 '20 at 16:16
It could be like Numrecordsperfile = 10 million / (num executors * cores * 2 ) is sample I can give — Ram Ghadiyaram, May 17 '20 at 01:11
Or if you don't know 10million or 20 million data then do count approx and and substitute in 10 million with that — Ram Ghadiyaram, May 17 '20 at 01:21
I tried it with 1 million and received this error: `SparkException: Job aborted due to stage failure: Total size of serialized results of 738 tasks (1027.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)` — Cauder, May 18 '20 at 03:06
This is seperate issue https://stackoverflow.com/a/47999105/647053 — Ram Ghadiyaram, May 18 '20 at 03:58

score 0 · Answer 2 · answered May 20 '20 at 10:00

The S3a connector will incrementally write blocks, but the (obsolete) version shipping with spark in hadoop-2.7.x doesn't handle it very well. IF you can, update all hadoop- Jars to 2.8.5 or 2.9.x.
the option "fs.s3a.multipart.size controls the size of the block. There's a limit of 10K blocks, so the max file you can upload is that size * 10,000. For very large files, use a bigger number than the default of "64M"

What's the best way to write a big file to S3?

2 Answers2

1) maxRecordsPerFile setting:

2) If you are using AWS Emr (Emrfs) this could be one of the point you can consider.

3) Using compression techniques , algo version and other spark configurations:

4) fast upload and other props in case you are using s3a: