Gathering multiple mapper's result sorted at Reducer in Hadoop

Question

I have multiple very large files(nearly 500MB) as input to my MR program. I divide(split) these files into equal size partitions. Each Mapper gets single partition of a file

Mapper : Key=(filename, partition_number) and Value= (character stream of partition)

I am applying some computation on value(character stream) in mapper. I want to gather result corresponding to a input file(for all of its partitons) in one reducer. So I thought of reducer i/p key as 'filename'. But those output from mapper must be gathered sequentially in reducer.( like [partition1 o/p + partition2 +...+partitionN o/p] )

Can you plz suggest me the logic. Thanks.

alexeipab · Accepted Answer · 2016-04-04T10:42:18.077

1

You need a secondary sort. For an example see https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/

In this case"

Primary Comparator compares on [filename, partition_number]
Group Comparator on filename only
Partitioner on filename only

edited Apr 04 '16 at 10:42

answered Apr 04 '16 at 10:34

alexeipab

3,609
14
16

Gathering multiple mapper's result sorted at Reducer in Hadoop

1 Answers1