I am using apache-spark My spark job creates 10k small files(~50MB) everyday would be overkill to name node in HDFS
I tried using coalesce to reduce the number of the output file, but is slowing down the job. Can any one suggest what should I use?
I am using apache-spark My spark job creates 10k small files(~50MB) everyday would be overkill to name node in HDFS
I tried using coalesce to reduce the number of the output file, but is slowing down the job. Can any one suggest what should I use?
We have a similar case. We run a batch job every hour and merge all new files. You can do this with another spark job or any other framework that works best for you. This way you decouple these 2 tasks completely and will get the best performance out of each one.
I figured out one solution!
Call coalesce with #partitions equal to #executors
By doing this any one task on executor will claim only its executor tasks output file.
Please let me knew if this looks good!
Have you tried repartition(#executors)
? It is possible it is better than coalesce(#executors)
.
According to Scaladoc for the coalesce
method,
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
Also please refer to: Spark: coalesce very slow even the output data is very small