Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

Question

Recently we migrated from "EMR on HDFS" --> "EMR on S3" (EMRFS with consistent view enabled) and we realized the Spark 'SaveAsTable' (parquet format) writes to S3 were ~4x slower as compared to HDFS but we found a workaround of using the DirectParquetOutputCommitter -[1] w/ Spark 1.6.

Reason for S3 slowness - We had to pay the so called Parquet tax-[2] where the default output committer writes to a temporary table and renames it later where the rename operation in S3 is very expensive

Also we do understand the risk of using 'DirectParquetOutputCommitter' which is possibility of data corruption w/ speculative tasks enabled.

Now w/ Spark 2.0 this class has been deprecated and we're wondering what options do we have on the table so that we don't get to bear the ~4x slower writes when we upgrade to Spark 2.0. Any Thoughts/suggestions/recommendations would be highly appreciated.

One workaround that we can think of is - Save on HDFS and then copy it to S3 via s3DistCp (any thoughts on how can this be done in sane way as our Hive metadata-store points to S3?)

Also looks like NetFlix has fixed this -[3], any idea on when they're planning to open source it?

Thanks.

[1] - https://github.com/apache/spark/blob/21d5ca128bf3afd5c2d4c7fcc56240e28443474f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/DirectParquetOutputCommitter.scala

[2] - https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/

[3] - https://www.youtube.com/watch?v=85sew9OFaYc&feature=youtu.be&t=8m39s http://www.slideshare.net/AmazonWebServices/bdt303-running-spark-and-presto-on-the-netflix-big-data-platform

I've just encountered the same issue and reverted back to emr 4.8. Curios to see the answers here. Some more info can be found here: https://issues.apache.org/jira/browse/SPARK-10063 — Niros, Sep 22 '16 at 12:45

score 16 · Answer 1 · answered Oct 13 '16 at 17:37

You can use: sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

since you are on EMR just use s3 (no need for s3a)

We are using Spark 2.0 and writing Parquet to S3 pretty fast (about as fast as HDFS)

if you want to read more check out this jira ticket SPARK-10063

score 1 · Answer 2 · answered Jul 13 '17 at 09:04

1

I think the S3 committer from Netflix is already open sourced at: https://github.com/rdblue/s3committer.

answered Jul 13 '17 at 09:04

viirya

644
6
8

As of today, it does not support writing parquet files: http://apache-spark-developers-list.1001551.n3.nabble.com/Output-Committers-for-S3-td21033.html – Cristian Jul 17 '17 at 07:39

Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

2 Answers2

Linked