0

Spark stores on-going processed data in the _temporary folder. Once the job finishes, the data is moved to its final destination. However, when there are tens of thousands of partitions, it takes quite some time to move the files from one place to the other. Question: how to speed up this move? Running applications in yarn-cluster mode, on a bare-metal Hadoop, not on AWS (no S3, EMR, etc).

Update: my job takes around 1 hour to generate 2.3T of data in 25000 partitions, and another hour to move data out of _temporary.

pgrandjean
  • 676
  • 1
  • 9
  • 19
  • You can use `coalesce(nbr)` at the beginning of your pipeline to reduce the nbr of partitions. – Xavier Guihot Mar 05 '18 at 19:35
  • `coalesce` isn't a scalable solution. Old answer and things change, but you might want to check this answer https://stackoverflow.com/questions/36927918/using-spark-to-write-a-parquet-file-to-s3-over-s3a-is-very-slow/36992096#36992096 – David Mar 05 '18 at 19:42
  • Possible duplicate of [Using Spark to write a parquet file to s3 over s3a is very slow](https://stackoverflow.com/questions/36927918/using-spark-to-write-a-parquet-file-to-s3-over-s3a-is-very-slow) – stevel Mar 07 '18 at 14:00

1 Answers1

2

you can speed it up by doing the move during task commit with the option spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2. However, it means that if a task fails during the commit process, the outcome is "undefined". You are trading speed for safety.

stevel
  • 12,567
  • 1
  • 39
  • 50