Handle uneven distribution of values across keys in Hadoop mapreduce

Question

I am dealing with a input log files in hadoop where the keys are not evenly distributed. This means that the reducers have uneven distribution of values. For example key1 has 1 value and key2 has 1000 value.

Is there any way to do the load balancing of the values associated with a same key [ I do not want to modify my key also]

Can you describe your job from an algorithm perspective - what are you trying to do to your keys once they make it into the reducer (for example is it a sum / min / max / avg calculation or similar - can part of this calculation be migrated to a combiner to reduce the flow of data between the mappers and reducers for the skewed keys?) — Chris White, Jul 26 '13 at 10:19

score 0 · Answer 1 · answered Jul 26 '13 at 00:04

If you know which keys are going to have an unusually large amount of values, you could use the following trick.

You could implement a custom Partitioner which would ensure that each of your skewed keys goes to a single partition, and then everything else would get distributed to the remaining partitions by their hashCode (which is what the default HashPartitioner does).

You can create a custom Partitioner by implementing this interface:

public interface Partitioner<K, V> extends JobConfigurable {
  int getPartition(K key, V value, int numPartitions);
}

And then you can tell Hadoop to use your Partitioner with:

conf.setPartitionerClass(CustomPartitioner.class);

Thank you very much @charles. Unfortunately I do not know which all keys will have large number of values. Also in you solution, this approach will cause a particular reducer [ the one which receives 1000 values] to process large amount of data. The reason I am concerned is that because for each value belonging to a specific key I do a lot of calculation [ you can say that some key will have 75000 values and I iterate through the values in reducer and do some calculations which take 2 minute each] — udag, Jul 26 '13 at 04:23

score 0 · Answer 2 · answered Jul 26 '13 at 14:51

Perhaps you could use a combiner before hitting the reducers? This is fairly speculative...

The idea being to partition each group of keys into partitions of a preset maximum size, and then output these partitioned k/v pairs to the reducer. This code assumes you've set that size in your configuration somewhere.

public static class myCombiner extends Reducer<Text, Text, Text, Text> {
    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {

        List<Text> textList = new ArrayList<Text>();
        int part = 0;

        while (values.iterator().hasNext()) {
            if (textList.size() <= Integer.parseInt(context.getConfiguration().get("yourMaxSize"))) {
                textList.add(values.iterator().next());

            } else {
                for(Text t : textList) {
                    //essentially partitioning each key...
                    context.write(new Text(key.toString() + "_" + Integer.toString(part)), t);
                }
                textList.clear();
            }
            part += 1;
        }
        //output any stragglers ... 
        for(Text t : textList) {
            context.write(new Text(key.toString() + "_" + Integer.toString(part)), t);
        }

    }
}

Handle uneven distribution of values across keys in Hadoop mapreduce

2 Answers2