0

I have multiple very large files(nearly 500MB) as input to my MR program. I divide(split) these files into equal size partitions. Each Mapper gets single partition of a file

Mapper : Key=(filename, partition_number) and Value= (character stream of partition)

I am applying some computation on value(character stream) in mapper. I want to gather result corresponding to a input file(for all of its partitons) in one reducer. So I thought of reducer i/p key as 'filename'. But those output from mapper must be gathered sequentially in reducer.( like [partition1 o/p + partition2 +...+partitionN o/p] )

Can you plz suggest me the logic. Thanks.

Community
  • 1
  • 1
Sumit
  • 27
  • 8

1 Answers1

1

You need a secondary sort. For an example see https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/

In this case"

  • Primary Comparator compares on [filename, partition_number]
  • Group Comparator on filename only
  • Partitioner on filename only
alexeipab
  • 3,609
  • 14
  • 16