0

I am using apache-spark My spark job creates 10k small files(~50MB) everyday would be overkill to name node in HDFS

I tried using coalesce to reduce the number of the output file, but is slowing down the job. Can any one suggest what should I use?

ToBeSparkShark
  • 641
  • 2
  • 6
  • 10
  • Do the logs show you why it runs slower with coalesce()? Is it only slowing down the saving to disk, or does it also reduce parallelism in upstream tasks? Perhaps you need to play around with the coalesce() parameters instead of reducing the number to drastically... – Jedi Jun 04 '16 at 04:09

3 Answers3

1

We have a similar case. We run a batch job every hour and merge all new files. You can do this with another spark job or any other framework that works best for you. This way you decouple these 2 tasks completely and will get the best performance out of each one.

z-star
  • 680
  • 5
  • 6
0

I figured out one solution!

Call coalesce with #partitions equal to #executors

By doing this any one task on executor will claim only its executor tasks output file.

Please let me knew if this looks good!

ToBeSparkShark
  • 641
  • 2
  • 6
  • 10
  • Spark produces as many output files as the number of partitions. One way to have a single output file is to call repartition(1) before writing to disk. Not sure about the coalesce() approach: even if you set #partitions equal to #executors, you have no guarantee that one executor has exactly one partition – Marco Jun 04 '16 at 20:02
  • @Marco coalesce(1) and repartition(1) are very bad practice. – eliasah Jun 05 '16 at 08:13
  • @eliasah i agree with you....that is why i suggested to use coalesce(#noOfExecutors)...what do you think about that – ToBeSparkShark Jun 05 '16 at 15:56
0

Have you tried repartition(#executors) ? It is possible it is better than coalesce(#executors).

According to Scaladoc for the coalesce method,

However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

Also please refer to: Spark: coalesce very slow even the output data is very small

CyberPlayerOne
  • 3,078
  • 5
  • 30
  • 51