5

I have a file containing a String, then a space and then a number on every line.

Example:

Line1: Word 2
Line2 : Word1 8
Line3: Word2 1

I need to sort the number in descending order and then put the result in a file assigning a rank to the numbers. So my output should be a file containing the following format:

Line1: Word1 8 1
Line2: Word  2 2
Line3: Word2 1 3

Does anyone has an idea, how can I do it in Hadoop? I am using java with Hadoop.

Deepika Sethi
  • 213
  • 1
  • 2
  • 10

3 Answers3

9

You could organize your map/reduce computation like this:

Map input: default

Map output: "key: number, value: word"

_ sorting phase by key _

Here you will need to override the default sorter to sort in decreasing order.

Reduce - 1 reducer

Reduce input: "key: number, value: word"

Reduce output: "key: word, value: (number, rank)"

Keep a global counter. For each key-value pair add the rank by incrementing the counter.

Edit: Here is a code snipped of a custom descendant sorter:

public static class IntComparator extends WritableComparator {

    public IntComparator() {
        super(IntWritable.class);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1,
            byte[] b2, int s2, int l2) {

        Integer v1 = ByteBuffer.wrap(b1, s1, l1).getInt();
        Integer v2 = ByteBuffer.wrap(b2, s2, l2).getInt();

        return v1.compareTo(v2) * (-1);
    }
}

Don't forget to actually set it as the comparator for your job:

job.setSortComparatorClass(IntComparator.class);
Tudor
  • 61,523
  • 12
  • 102
  • 142
  • Thanks for the reply. Do you have any references for overriding the sorter to sort in decreasing order?? Thanks – – Deepika Sethi Nov 27 '11 at 23:03
  • 1
    Using one Reducer not practical for big data. The input keys have to be split into ranges and a custom partitioner used. See [Yahoo TeraSort PDF](http://sortbenchmark.org/YahooHadoop.pdf) for details. Code is in the [org.apache.hadoop.examples.terasort](http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/examples/terasort/package-frame.html) package. The keys sent to the reducers are already sorted. Use Job.setSortComparatorClass or if using Writables override WritableComparable#compareTo for custom sorting. – Praveen Sripati Nov 28 '11 at 01:11
  • 1
    @Tudor: I tried to use the above code which you gave me in my program but it gives me the following exception: java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:497) at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355) at org.myorg.RankAssign$IntComparator.compare(WordCount.java:83) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:942) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:942) at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:30) Do you have any idea on this? – Deepika Sethi Jan 11 '12 at 04:49
  • @Deepika Sethi: Very strange. That exception is thrown if there are less than 4 bytes in the byte[] array b1 or b2, such that it cannot decode an integer. Are you sure your map phase output is correct? – Tudor Jan 11 '12 at 08:41
  • @Tudor : I removed the blank lines in the input text file and the error vanished. Thanks for giving me a nice explanation. But now the issue is that the file in not sorted in descending order if I use the above code. It gives me output like : 1 , The Man in Question,non-vandalized \n 1 , Carefree Highway,non-vandalized \n 1 , Madhurgairola001,non-vandalized \n 1 , Creative cooking,non-vandalized \n 2 , 220.237.242.105,non-vandalized \n 1 , 220.236.170.109,non-vandalized \n 1 , 213.100.228.201,non-vandalized \n I have put \n for depicting the next line. Thanks – Deepika Sethi Jan 11 '12 at 19:03
  • O/p continued ::::: 1 , 122.107.145.182,non-vandalized \n 1 , 203.192.200.125,non-vandalized \n 1 , Shattered Gnome,non-vandalized \n 1 , FrenchIsAwesome,non-vandalized \n 1 , 122.162.122.119,non-vandalized \n 3 , 124.123.24.242,non-vandalized \n 2 , Themightyquill,non-vandalized \n 2 , 206.248.170.80,non-vandalized \n 2 , 119.224.46.126,non-vandalized \n 2 , 203.173.160.17,non-vandalized \n 2 , 68.237.226.229,non-vandalized \n – Deepika Sethi Jan 11 '12 at 19:08
  • @Deepika Sethi: Can you please post this output in the question with formatting? It's hard to read it like this. – Tudor Jan 11 '12 at 19:52
  • @Tudor : I have put my output in the question above. I could not find any other way I can nicely format it :P – Deepika Sethi Jan 12 '12 at 02:01
  • @Deepika Sethi: Hmmm, can you tell me how you have implemented the reducer? – Tudor Jan 12 '12 at 09:59
  • @Tudor : I haven't implemented it yet but for the time being I kept it as : public void reduce(Text key, Iterator values, Context context) throws IOException,InterruptedException { while (values.hasNext()) { counter = values.next().get(); counter++; } context.write(key, new IntWritable(counter)); – Deepika Sethi Jan 12 '12 at 16:44
  • @Deepika Sethi: Ok so that output is only from the map phase? Or are you using just an identity reducer? – Tudor Jan 12 '12 at 16:47
  • @Tudor: The above output is what I got in reduce phase as keys in the output file and I got all 1 as values in the output file(which is not shown). I have to implement the reducer after I finish with the sorting. Once my key is sorted in descending order I will assign ranks starting from 1 as values in the reducer. But right now I am trying to figure out how to sort my key in descending order. – Deepika Sethi Jan 12 '12 at 17:42
5

Hadoop Streaming - Hadoop 1.0.x

According to this, after the

bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.*.jar
  1. you add a comparator

    -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator

  2. you specify the kind of sorting you want

    -D mapred.text.key.comparator.options=-[ options]

where the [ options] are similar to Unix sort. Here are some examples,

Reverse order

-D mapred.text.key.comparator.options=-r

Sort on numeric values

-D mapred.text.key.comparator.options=-n

Sort on value or whatever field

-D mapred.text.key.comparator.options=-kx,y

with the -k flag you specify the key of sorting. The x, y parameters define this key. So, if you have a line with more than one tokens, you can choose which token of all will be the key of sorting or which combination of tokens will be the key of sorting. See the references for more details and examples.

vpap
  • 1,389
  • 2
  • 21
  • 32
2

I devised the solution to this problem. It was simple actually.

For sorting by value you need to use

setOutputValueGroupingComparator(Class)

For sorting in decreasing order you need to use setSortComparatorClass(LongWritable.DecreasingComparator.class);

For ranking you need to use Counter class, getCounter and increment function.

oers
  • 18,436
  • 13
  • 66
  • 75
Deepika Sethi
  • 213
  • 1
  • 2
  • 10