mapper behaving differently on local and on the cluster

Question

I run a map only job(on Hadoop) in order to sort the key values, because it's said "Hadoop automatically sorts data emitted by mappers before being sent to reducers".

input file

2013-04-15      835352
2013-04-16      846299
2013-04-17      828286
2013-04-18      747767
2013-04-19      807924

I think Map(second_cloumn, first_column) should sort this file as shown in output1. It actually did when i run this on my local machine. But when i run this on a cluster, the output is like shown in output2.

output1 file

747767  2013-04-18
807924  2013-04-19
828286  2013-04-17
835352  2013-04-15
846299  2013-04-16

output2 file

835352  2013-04-15
747767  2013-04-18
807924  2013-04-19
828286  2013-04-17
846299  2013-04-16

How can I guarantee it to be always like in the output1. I am open for another suggestions for sorting by the second column.

Mapper

public class MapAccessTime1 extends Mapper<LongWritable, Text, IntWritable, Text> {

    private IntWritable one = new IntWritable(1);
    private Text word = new Text();

    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();
        int val = 0;
        StringTokenizer tokenizer = new StringTokenizer(line);
        if (!line.startsWith("#")) {
            if (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
            }
            if (tokenizer.hasMoreTokens()) {
                val = Integer.parseInt(tokenizer.nextToken());
                one = new IntWritable(val);
                context.write(one, word);
            }
        }
    }
}

Can you provide some code? also, what do you get as key in your mapper? — Ravi Bhatt, Aug 05 '13 at 09:42
Yes please provide some code to get your implementation details — Binary01, Aug 05 '13 at 09:54
@proofmoore: are you using reducers? If yes, how many and if you can please show the reduce code too. Thanks! — SSaikia_JtheRocker, Aug 05 '13 at 16:56

score 0 · Answer 1 · answered Aug 14 '13 at 13:28

0

MapOnly job doesn't do shuffle and sorting. Using an identity reducer solves my problem.

answered Aug 14 '13 at 13:28

likeaprogrammer

405
1
5
13

mapper behaving differently on local and on the cluster

input file

output1 file

output2 file

Mapper

1 Answers1