9

I want to know if the OutputCollector's 'instance' output used in the map function: output.collect(key, value) this -output- be storing the key value pairs somewhere? even if it emits to the reducer function, their must be an intermediate file, right? What are those files? Are they visible and decided by the programer? Are the OutputKeyClass, and OutputValueClasses which we specify in the main function these places of storage? [Text.class and IntWritable.class]

Im giving the standard code for Word Count example in MapReduce, which we can find at many places in the net.

public class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));    
JobClient.runJob(conf);
}
}
Chaos
  • 11,213
  • 14
  • 42
  • 69
catty
  • 91
  • 1
  • 1
  • 3

3 Answers3

4

The output from the Map function is stored in Temporary Intermediate Files. These files are handled transparently by Hadoop, so in a normal scenario, the programmer doesn't have access to that. If you're curious about what's happening inside each mapper, you can review the logs for the respective job where you'll find a log file for each map task.

If you want to control where the temporary files are generated, and have access to them, you have to create your own OutputCollector class, and I don't know how easy that is.

If you want to have a look at the source code, you can use svn to get it. I think it is available here: http://hadoop.apache.org/common/version_control.html.

Chaos
  • 11,213
  • 14
  • 42
  • 69
2

I believe they are stored in temporary locations and not available for the developer, unless you create your own class that implements OutputCollector.

I once had to access those files and solved the problem by creating side-effect files: http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+Side-Effect+Files

Ulises
  • 13,229
  • 5
  • 34
  • 50
0

The intermediate, grouped outputs are always stored in SequenceFiles. Applications can specify if and how the intermediate outputs are to be compressed and which CompressionCodecs are to be used via the JobConf.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/Mapper.html