Hadoop: How does OutputCollector work during MapReduce?

Question

I want to know if the OutputCollector's 'instance' output used in the map function: output.collect(key, value) this -output- be storing the key value pairs somewhere? even if it emits to the reducer function, their must be an intermediate file, right? What are those files? Are they visible and decided by the programer? Are the OutputKeyClass, and OutputValueClasses which we specify in the main function these places of storage? [Text.class and IntWritable.class]

Im giving the standard code for Word Count example in MapReduce, which we can find at many places in the net.

public class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));    
JobClient.runJob(conf);
}
}

Why do you want to access these temporary files? Do you have a sepecfic thing you want to achieve? or is it just curisoity? — adranale, Jun 12 '12 at 14:05

score 4 · Answer 1 · answered Jun 14 '12 at 05:21

The output from the Map function is stored in Temporary Intermediate Files. These files are handled transparently by Hadoop, so in a normal scenario, the programmer doesn't have access to that. If you're curious about what's happening inside each mapper, you can review the logs for the respective job where you'll find a log file for each map task.

If you want to control where the temporary files are generated, and have access to them, you have to create your own OutputCollector class, and I don't know how easy that is.

If you want to have a look at the source code, you can use svn to get it. I think it is available here: http://hadoop.apache.org/common/version_control.html.

score 2 · Answer 2 · answered Jun 12 '12 at 12:55

2

I believe they are stored in temporary locations and not available for the developer, unless you create your own class that implements OutputCollector.

I once had to access those files and solved the problem by creating side-effect files: http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+Side-Effect+Files

answered Jun 12 '12 at 12:55

Ulises

13,229
5
34
50

Does anyone have the code for OutputCollector's .collect() function? – catty Jun 13 '12 at 04:47

score 0 · Answer 3 · answered Sep 25 '13 at 10:31

The intermediate, grouped outputs are always stored in SequenceFiles. Applications can specify if and how the intermediate outputs are to be compressed and which CompressionCodecs are to be used via the JobConf.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/Mapper.html

Hadoop: How does OutputCollector work during MapReduce?

3 Answers3

Linked