These are the points you could consider...
1) maxRecordsPerFile setting:
With
my_table.write.parquet(s3path)
Spark writes a single file out per task.
The number of saved files is = the number of partitions of the RDD/Dataframe being saved. Thus, this could result in ridiculously large files (of couse you can repartition your data and save repartition means shuffles the data across the networks.).
To limit number of records per file
my_table.write.option("maxRecordsPerFile", numberOfRecordsPerFile..yourwish).parquet(s3path)
It can avoid generating huge files.
2) If you are using AWS Emr (Emrfs) this could be one of the point you can consider.
emr-spark-s3-optimized-committer
When the EMRFS S3-optimized Committer is Not Used :
- When using the S3A file system.
- When using an output format other than Parquet, such as ORC or text.
3) Using compression techniques , algo version and other spark configurations:
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 2)
.config("spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored", true)
.config("spark.hadoop.parquet.enable.summary-metadata", false)
.config("spark.sql.parquet.mergeSchema", false)
.config("spark.sql.parquet.filterPushdown", true) // for reading purpose
.config("mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.sql.parquet.compression.codec", "snappy")
.getOrCreate()
4) fast upload and other props in case you are using s3a:
.config("spark.hadoop.fs.s3a.fast.upload","true")
.config("spark.hadoop.fs.s3a.fast.upload","true")
.config("spark.hadoop.fs.s3a.connection.timeout","100000")
.config("spark.hadoop.fs.s3a.attempts.maximum","10")
.config("spark.hadoop.fs.s3a.fast.upload","true")
.config("spark.hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
.config("spark.hadoop.fs.s3a.fast.upload.active.blocks","4")
.config("fs.s3a.connection.ssl.enabled", "true")