0

I'm new to Hadoop MapReduce and I've recently encountered a problem in how to do the binning of output values in the mapper. My mapper creates a Text, IntWritable output with a dataset ID as a key and a length of metadata description as a value. My goal is to bin the frequencies of metadata length into 3 groups: 1-200 characters, 201-400 characters, and 401+ characters. The output file looks as follows (first column is the key, second column is the value - length of metadata):

1   256
2   344
3   234
4   160
5   432
6   121
7   551
8   239
9   283
10   80
...

Based on the values above the binning result should display:

1-200     3
201-400   5
401-...   2

Any ideas on how to approach it? Should I do it as the Mapper cleanup, Combiner or within a Reducer? Code examples or references to other online sources would be appreciated. Thank you.

simtim
  • 231
  • 2
  • 14

1 Answers1

0

It is known that data needs to binned into the three bins. These bin can be declared statically in the Mapper like:

Text BIN1 = new Text("1-200");     // bin-1
Text BIN2 = new Text("200-400");   // bin-2
Text BIN3 = new Text("400-...");   // bin-3

Now create a HashMap<Text, Integer> map in the map. As the Mapper reads the data, it will update the count for the respective bin in the map. Then write this map in the cleanup() method. The output of the Mapper is Text and IntWritable (which is count for each bin from the map).

If there are multiple Mappers in the job, then Mapper output can be aggregated in the Reducer, with simple sum of the Iterable<IntWritable> values for each key (Text).

YoungHobbit
  • 13,254
  • 9
  • 50
  • 73