8

My question is about mapreduce programming in java.

Suppose I have the WordCount.java example, a standard mapreduce program. I want the map function to collect some information, and return to the reduce function maps formed like: <slaveNode_id,some_info_collected>,

so that I can know what slave node collected what data.. Any idea how??

public class WordCount {

    public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();

      public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
          word.set(tokenizer.nextToken());
          output.collect(word, one);
        }
      }
    }

    public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
      public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
        int sum = 0;
        while (values.hasNext()) {
          sum += values.next().get();
        }
        output.collect(key, new IntWritable(sum));
      }
    }

    public static void main(String[] args) throws Exception {
      JobConf conf = new JobConf(WordCount.class);
      conf.setJobName("wordcount");

      conf.setOutputKeyClass(Text.class);
      conf.setOutputValueClass(IntWritable.class);

      conf.setMapperClass(Map.class);
      conf.setCombinerClass(Reduce.class);
      conf.setReducerClass(Reduce.class);

      conf.setInputFormat(TextInputFormat.class);
      conf.setOutputFormat(TextOutputFormat.class);

      FileInputFormat.setInputPaths(conf, new Path(args[0]));
      FileOutputFormat.setOutputPath(conf, new Path(args[1]));

      JobClient.runJob(conf);
    }
}

Thank you!!

Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91
pr_prog_84
  • 139
  • 3
  • 9

2 Answers2

5

What you are asking is to let the application (your map-reduce thingy) know about the infrastructure it ran on.

In general the answer is that your application doesn't need this information. Each call to the Mapper and each call to the Reducer can be executed on a different node or all on the same node. The beauty of MapReduce is that the outcome is the same, so for your application: it doesn't matter.

As a consequence the API don't have features to support this request of yours.

Have fun learning Hadoop :)


P.S. The only way I can think of (which is nasty to say the least) is that you include a system call of some sort in the Mapper and ask the underlying OS about it's name/properties/etc. This kind of construct would make your application very non-portable; i.e. it won't run on Hadoop in Windows or Amazon.

Niels Basjes
  • 10,424
  • 9
  • 50
  • 66
  • Not exactly, he has the information about the slaves in his data: . Wordcount reverses this to , he want's the other way round to get all information a slaveNode collected. – Thomas Jungblut May 29 '11 at 07:48
  • There is no way of knowing the id of the slavenode from within the MR application. – Niels Basjes May 30 '11 at 11:13
  • can i use the slave_node_id in the MR program without printing something? The point is that the data will be categorized by which node took what. But it is not necessary for the user to see the node_id – pr_prog_84 May 30 '11 at 13:42
  • How do see using something (like a slave_node_id) if you don't have it at all?!?!? – Niels Basjes May 30 '11 at 14:58
1

Wordcount is the wrong example for you. You want to simply merge all information together. This inverses the things to wordcount.

Basically you're just emitting your slaveNode_id as a IntWritable (if this is possible) and the information as Text.

  public static class Map extends MapReduceBase implements Mapper<LongWritable, Text,IntWritable, Text> {
    private Text word = new Text();

  public void map(LongWritable key, Text value, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
      word.set(tokenizer.nextToken());
      // you have to split your data here: ID and value
      IntWritable id = new IntWritable(YOUR_ID_HERE);

      output.collect(id, word);
    }
  }
}

And the reducer would go the same way:

 public static class Reduce extends MapReduceBase implements Reducer<IntWritable, Text,IntWritable, Text> {
  public void reduce(IntWritable key, Iterator<Text> values, OutputCollector<IntWritable,Text> output, Reporter reporter) throws IOException {

      // now you have all the values for a slaveID as key. Do whatever you like with that...
      for(Text value : values)
         output.collect(key, value)
  }
}
Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91
  • Interesting.. But my question remains.. how can I get the slave_node_id from the program?? – pr_prog_84 May 29 '11 at 07:58
  • the map will give the reducer maps where info is type class with attributes some information I got from the internet.. – pr_prog_84 May 29 '11 at 08:03
  • So what is the problem with it. After running my code you'll have slave_node_id as key and all the values associated with it. Then you have probably a sequencefile where you can iterate over and see your results. You should really head for a tutorial first o_o – Thomas Jungblut May 29 '11 at 08:08
  • :P I've read for these kind of stuff, and then I asked here.. In the YOUR_ID_HERE, what do you mean? Where can I find this ID? – pr_prog_84 May 29 '11 at 08:15
  • I expect that your data is textfile and looks like this right? So "" is what is inside your text object you get in the mapper. You have to parse your slave_node_id out of this text object. And then you have to emit this ID as your key, so hadoop can sort on this. And merge all the values with the same key together. Is that what you want? – Thomas Jungblut May 29 '11 at 08:19
  • You're kinda close.. In my input text files there will be some other info so that each node can get information needed from websites and map them up into – pr_prog_84 May 29 '11 at 08:27
  • my question is: is there any command so that I can get the slave node ID that is working at the current time?? – pr_prog_84 May 29 '11 at 08:28
  • So you want the Hadoop slave nodes? Or what?:D – Thomas Jungblut May 29 '11 at 08:29
  • Man, just tell this ;D You can query the JobTracker for live nodes, this can be done through the shell on your master node. I'll pass the command to you, just have to start my cluster. – Thomas Jungblut May 29 '11 at 08:33
  • okay sorry for the break, for datanodes you can use: "bin/hadoop dfsadmin -report" – Thomas Jungblut May 29 '11 at 11:44