1

I'm trying to use the magic output committer, But whatever I do I get the default output committer.

INFO FileOutputCommitter: File Output Committer Algorithm version is 10
22/03/08 01:13:06 ERROR Application: Only 1 or 2 algorithm version is supported

This is how I know I'm using it according to Hadoop docs. What am I doing wrong? this is my relevant conf (Using SparkConf()), I tried many others.

  .set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
  .set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "10")
  .set("spark.hadoop.fs.s3a.committer.magic.enabled", "true")
  .set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory")
  .set("fs.s3a.committer.name", "magic")
  .set("spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
  .set("spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")

I do not have any other configuration relevant to that. Not in code or conf files (Hadoop or Spark), maybe I should? The pathes I'm writing to starts with s3://. Using Hadoop 3.2.1, Spark 3.0.0 and EMR 6.1.1

idan ahal
  • 707
  • 8
  • 21
  • EMR has its own s3-ready committer, covered somewhere in its docs – stevel Mar 08 '22 at 16:35
  • Do you refer to that one: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html So I should just update my EMR version and that is enough? The s3a committers are for use if I do not use EMR? – idan ahal Mar 08 '22 at 19:05
  • yes and yes. always read the docs before asking on stack overflow as it shows you've done the homework and saves at least 24h waiting for a reply -a reply which may be wrong – stevel Mar 09 '22 at 15:28
  • Actually I did that and the run time was reduced by 50% – idan ahal Mar 09 '22 at 20:15

1 Answers1

2

So After a lot of reading + stevel comment, I found what I need. I'm using the optimized output committer which is built-in EMR and used by default. The reason I didn't use it at first was that the AWS optimized committer is activated only when it can. Until EMR 6.4.0 it worked only on some conditions but from 6.4.0 it works on every write type txt csv parquet and with rdd datagram and dataset. So I was just needed to update to EMR 6.4.0.

There was an improvement of 50-60 percent in execution time.

The optimized committer requeirments.

idan ahal
  • 707
  • 8
  • 21
  • happy to hear this. they probably targeted parquet as the lib is fussy. It only works with committers which are a subclass of ParquetOutputCommitter, so either you implement that or set up some bridging mechanism. I guess they focused on Parquet first and added text/csv etc later. – stevel Mar 11 '22 at 15:24
  • oh, one thing I'm now curious about. did they write a zero byte _SUCCESS file, or was there some JSON instead. Convention in the asf and ibm committers is we put some JSON identifying the committer and providing stats and diagnostics. makes verifying whether the new committer was used trivial -just require the file length to be > 0 – stevel Mar 11 '22 at 15:26
  • _SUCCESS file is still 0 in size. It's also very hard to understand whether or not the optimization is activated. You should search for INFO comments in the logs related to: "EmrOptimizedParquetOutputCommitter" and "FileSystemOptimizedCommitter" – idan ahal Mar 12 '22 at 16:33
  • 1
    that's a shame; AWS team missed a feature there. latest iterations of the ASF cloud committers (MAPREDUCE-7341 manifest committer for GCS and ABFS uses the same formats, adds a report dir and collecting stats on how long stages took, throttling events etc etc) – stevel Mar 14 '22 at 10:36