2

I'm using zeppelin and spark, and I'd like to take a 2TB file from S3 and run transformations on it in Spark, and then send it up to S3 so that I can work with the file in Jupyter notebook. The transformations are pretty straightforward.

I'm reading the file as a parquet file. I think it's about 2TB, but I'm not sure how to verify.

It's about 10M row and 5 columns, so it's pretty big.

I tried to do my_table.write.parquet(s3path) and I tried my_table.write.option("maxRecordsPerFile", 200000).parquet(s3path). How do I come up with the right way to write a big parquet file?

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
Cauder
  • 2,157
  • 4
  • 30
  • 69
  • 10 million rows isn't necessarily big but my calculations suggest each column must have some large json blob or something? Can you give some more details on the data structure. – 9bO3av5fw5 May 15 '20 at 22:45

2 Answers2

4

These are the points you could consider...

1) maxRecordsPerFile setting:

With

my_table.write.parquet(s3path)

Spark writes a single file out per task.

The number of saved files is = the number of partitions of the RDD/Dataframe being saved. Thus, this could result in ridiculously large files (of couse you can repartition your data and save repartition means shuffles the data across the networks.).

To limit number of records per file

my_table.write.option("maxRecordsPerFile", numberOfRecordsPerFile..yourwish).parquet(s3path)

It can avoid generating huge files.

2) If you are using AWS Emr (Emrfs) this could be one of the point you can consider.

emr-spark-s3-optimized-committer

When the EMRFS S3-optimized Committer is Not Used :

  • When using the S3A file system.
  • When using an output format other than Parquet, such as ORC or text.

3) Using compression techniques , algo version and other spark configurations:

.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 2)
.config("spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored", true)
.config("spark.hadoop.parquet.enable.summary-metadata", false)
.config("spark.sql.parquet.mergeSchema", false)
.config("spark.sql.parquet.filterPushdown", true) // for reading purpose 
.config("mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.sql.parquet.compression.codec", "snappy")
.getOrCreate()

4) fast upload and other props in case you are using s3a:

  .config("spark.hadoop.fs.s3a.fast.upload","true")
  .config("spark.hadoop.fs.s3a.fast.upload","true")
  .config("spark.hadoop.fs.s3a.connection.timeout","100000")
  .config("spark.hadoop.fs.s3a.attempts.maximum","10")
  .config("spark.hadoop.fs.s3a.fast.upload","true")
  .config("spark.hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
  .config("spark.hadoop.fs.s3a.fast.upload.active.blocks","4")
  .config("fs.s3a.connection.ssl.enabled", "true")
Community
  • 1
  • 1
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
0
  1. The S3a connector will incrementally write blocks, but the (obsolete) version shipping with spark in hadoop-2.7.x doesn't handle it very well. IF you can, update all hadoop- Jars to 2.8.5 or 2.9.x.
  2. the option "fs.s3a.multipart.size controls the size of the block. There's a limit of 10K blocks, so the max file you can upload is that size * 10,000. For very large files, use a bigger number than the default of "64M"
stevel
  • 12,567
  • 1
  • 39
  • 50