hadoop streaming getting optimal number of slots

Question

I have a streaming map-reduce job. I have some 30 slots for processing. Initially I get a single input file containing 60 records (fields are tab separated), first field of every record is a number, for first record number(first field) is 1, for second record number(first field) is 2 and so on. I want to create 30 files from these records for next step of processing, each containing 2 records each (even distribution).

For this to work I specified number of reducers to hadoop job as 30. I expected that first field will be used as key and I will get 30 output files each containing 2 records.

I do get 30 output files but not all containing same number of records. Some files are even empty (zero size). Any idea

You have to write your own partitioner, the hashpartitioner does not guarantee a perfect distribution over all tasks. — Thomas Jungblut, May 25 '12 at 09:23

score 0 · Answer 1 · answered May 29 '12 at 06:56

0

Hadoop by default suffle and combine the Map task outputs as Reducer input.So Map output sets having same key values are mapped to same reducer.so by doing this some reducer may not have input sets ,so say part-00005 file will be of size 0 KB.

answered May 29 '12 at 06:56

Black_Rider

1,465
2
16
18

score 0 · Answer 2 · answered May 30 '12 at 01:45

What's your output key type? If you're using Text rather than IntWritable (which i assume you must be as you're using streaming), then the reduce number is calculated based upon the hash of the bytes representation the UTF-8 'string' of the key value. You can write a simple unit test to observe this in action:

public class TextHashTest {
    @Test
    public void testHash() {
        int partitions = 30;
        for (int x = 0; x < 100; x++) {
            int hash = new Text(String.valueOf(x)).hashCode();
            int part = hash % partitions;
            System.err.printf("%d = %d => %d\n", x, hash, part);            
        }
    }
}

I won't paste the output, but of the 100 values, partition bins 0-7 never receive any value.

So like Thomas Jungblut says in his comment, you'll need to write a custom partitioner to translate the Text value back into an integer value, and then modulo this number by total number of partitions - but this may still not give you 'even' distribution if the values themselves are not in a 1-up sequence (which you say they are so you should be ok)

public class IntTextPartitioner implements Partitioner<Text, Text> {
    public void configure(JobConf job) {}

    public int getPartition(Text key, Text value, int numPartitions) {
        return Integer.valueOf(key.toString()) % numPartitions;
    }            
}

hadoop streaming getting optimal number of slots

2 Answers2