10

I wrote a Spark program that mimics functionality of an existing Map Reduce job. The MR job takes about 50 minutes every day, but the Spark job took only 9 minutes! That’s great!

When I looked at the output directory, I noticed that it created 1,020 part files. The MR job uses only 20 reducers so it creates only 20 files. We need to cut down on # of output files; otherwise our Namespace would be full in no time.

I am trying to figure out how I can reduce the number of output files under Spark. Seems like 1,020 tasks are getting triggered and each one creates a part file. Is this correct? Do I have to change the level of parallelism to cut down no. of tasks thereby reducing no. of output files? If so how do I set it? I am afraid cutting down no. of tasks will slow down this process – but I can test that!

Mikel Urkia
  • 2,087
  • 1
  • 23
  • 40
DilTeam
  • 2,551
  • 9
  • 42
  • 69

1 Answers1

12

Cutting down the number of reduce tasks will slow down the process for sure. However, it still should be considerably faster than Hadoop MapReduce for your use case.

In my opinion, the best method to limit the number of output files is using the coalesce(numPartitions) transformation. Below is an example:

JavaSparkContext ctx = new JavaSparkContext(/*your configuration*/);

JavaRDD<String> myData = ctx.textFile("path/to/my/file.txt");

//Consider we have 1020 partitions and thus 1020 map tasks
JavaRDD<String> mappedData = myData.map( your map function );

//Consider we need 20 output files
JavaRDD<String> newData = mappedData.coalesce(20)
newData.saveAsTextFile("output path");

In this example, the map function would be executed by 1020 tasks, which would not be altered in any way. However, after having coalesced the partitions, there should only be 20 partitions to work with. In that case, 20 output files would be saved at the end of the program.

As mentioned earlier, take into account that this method will be slower than having 1020 output files. The data needs to be stored into few partitions (from 1020 to 20).

Note: please take a look to the repartition command on the following link too.

Mikel Urkia
  • 2,087
  • 1
  • 23
  • 40
  • Thanks a lot Mikel. It worked very well. Also, removed some bad code on my side so in fact it runs even faster now. Total time: 5 Minutes! Apache Spark ROCKS! – DilTeam Sep 22 '14 at 21:25
  • I am glad you got it working. If this answer was the solution for your problem please check it as final answer to close the question. – Mikel Urkia Sep 23 '14 at 06:50
  • I have been taking a look to other of your questions and no one has been marked as answered although they were actually solved. Please update your questions and mark the answers (even if they are given by you). – Mikel Urkia Sep 23 '14 at 14:04
  • 1
    I didn't know this feature of "final answer". I've checked the 'check mark' against the answer. Hope this is the right way to do this. If not, please let me know. Thanks. – DilTeam Sep 23 '14 at 17:28
  • Indeed, that is the correct way of doing it. It is important to close all the questions when solved to close them and make them searchable more easily. – Mikel Urkia Sep 23 '14 at 20:32
  • For this case, I actually had to repartition the RDD[_] before calling coalesce, so in the example @MikelUrkia wrote, calling `JavaRDD newData = mappedData.repartition(20).coalesce(20)` worked. Mine was using Spark version 2.3.1 and the Scala API – Joyoyoyoyoyo Nov 01 '18 at 19:24
  • Spark will push down the coalesce operation to as early a point as possible. This means it can affect parallelism (decrease) in earlier operations. This means your map operation will run will 20 partitions as well. – idan ahal Jul 18 '23 at 10:57