0

While working on Secondary sort issue from the definitive guide , I came across a code like this :

 @Override
public int getPartition(TextpairWritable tp, IntWritable value, int numPartitions) {

    return Math.abs(Integer.parseInt(tp.getyear().toString()) * 127) % numPartitions;   
}

I want to understand what is the meaning of line:

return Math.abs(Integer.parseInt(tp.getyear().toString()) * 127) % numPartitions;

If I don't tell the number of reducers in driver code, how does hadoop know the value of this parameter in above line. what is the significance of multiplying it with 127 ?

DevHelp
  • 305
  • 4
  • 21

1 Answers1

0

return Math.abs(Integer.parseInt(tp.getyear().toString()) * 127) % numPartitions;

You can consider it as hashing, on the basis of your key's year attribute value. You can choose any (prime) number for multiplication with the value you are getting. Here the chosen value is 127. The last part, numPartitions defines, into how many buckets (reducers) the data needs to be divided.

If I don't tell the number of reducers in driver code, how does hadoop know the value of this parameter in above line.

The default value for parameter is 1. So all the data (output of mappers) goes to the same reducer task.

what is the significance of multiplying it with 127?

It is a prime number. We usually multiply with prime number so that you can handle/ignore the shew-ness of the data. Prime number are not divisible by any other number, so they help in distributing the data evenly over the range.

YoungHobbit
  • 13,254
  • 9
  • 50
  • 73
  • 1
    There is a weird case in Java when Math.abs returns negative number, for example Math.abs(Integer.MIN_VALUE) % 3 < 0 and partition can not be negative. It is much safer to abs after mod: return Math.abs(Integer.parseInt(tp.getyear().toString()) * 127 % numPartitions); – alexeipab Sep 11 '15 at 14:06
  • You are absolute right. I was not aware of the fact, `Math.abs()`, if the argument is equal to the value of `Integer.MIN_VALUE`, the most negative representable int value, the result is that same value, which is negative. – YoungHobbit Sep 11 '15 at 14:51
  • Did the other things about custom partitioner are now clear to you through the answer? – YoungHobbit Sep 11 '15 at 14:52
  • Lets says my custom key is combination of year and day_of_week and output of mapper is (1979,1 1 )(1979,1 1 )(1979,1 1 )(1979,2 1) (1979,2 1)(1979,3 1) ... (1980,1 1) and so on. Basically its telling the year 1979 , the day Sunday and count is 1 which will be repeated number of times before it is aggregated. Now what will be the output of partitioner written above ? – DevHelp Sep 11 '15 at 15:59
  • In the partitioner logic you are using only year part so for the same value of year it will generate same number. This number represent which reducer your data will be sent for processing. Your number of reducer are defined by numpartition variable, which start from 0. – YoungHobbit Sep 11 '15 at 16:04
  • This way all the data having same key goes to the same reducer. Other way of understanding keys resulting same mod value goes to same reducer. – YoungHobbit Sep 11 '15 at 16:14