1

I am running a mapreduce job which read the input and sorts it using multiple reduces. I am able to get the output sorted with the number of reducers to be 5. However, the output is written to only 1 file and have 4 empty files along with it. I am using an input sampler and totalorderpartitioner for global sorting.

My driver looks like follows:

int numReduceTasks = 5;
    Configuration conf = new Configuration();
    Job job = new Job(conf, "DictionarySorter");
    job.setJarByClass(SampleEMR.class);
    job.setMapperClass(SortMapper.class);
    job.setReducerClass(SortReducer.class);
    job.setPartitionerClass(TotalOrderPartitioner.class);
    job.setNumReduceTasks(numReduceTasks);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);


    FileInputFormat.setInputPaths(job, input);
    FileOutputFormat.setOutputPath(job, new Path(output
            + ".dictionary.sorted." + getCurrentDateTime()));
    job.setPartitionerClass(TotalOrderPartitioner.class);

    Path inputDir = new Path("/others/partitions");

    Path partitionFile = new Path(inputDir, "partitioning");
    TotalOrderPartitioner.setPartitionFile(job.getConfiguration(),
            partitionFile);

    double pcnt = 1.0;
    int numSamples = numReduceTasks;
    int maxSplits = numReduceTasks - 1;
    if (0 >= maxSplits)
        maxSplits = Integer.MAX_VALUE;

    InputSampler.Sampler<LongWritable, Text> sampler = new InputSampler.RandomSampler<LongWritable, Text>(pcnt,
            numSamples, maxSplits);
    InputSampler.writePartitionFile(job, sampler);
    job.waitForCompletion(true);

1 Answers1

0

Your RandomSampler parameters seem suspicious to me:

  • The first parameter freq is a probability, not a percentage. For pcnt = 1 you are sampling 100% of the records.
  • The second parameter numSamples should be bigger. It should be enough to represent the distribution of your whole dataset.

Imagine you have the following keys: 4,7,8,9,4,1,2,5,6,3,2,4,7,4,8,1,7,1,8,9,9,9,9

Using freq = 0.3 and numSamples = 10. for the sake of simplicity let's say 0.3 means every 3 keys one if sampled. This will collect the following sample: 4,9,2,3,7,1,8,9. This will be sorted into 1,2,3,4,7,8,9,9. This sample has 8 elements, so all of them are kept, because it does not exceed the maximum number of samples numSamples = 10. Out of this sample, the boundaries for your reducers will be something like 2,4,8,9. This means that if a pair has the key "1" it will end up in Reducer #1. A pair with key "2" will end up in Reducer #2. A pair with key "5" will end up in Reducer #3, etc... This would be a good distribution.

Now if we run your values on the same example keys. Your freq = 1 so you take each key into the sample. So your sample will be the same as the initial keyset. Except that you set a max number of samples numSamples = 4, which means that you only keep 4 elements in your sample. Your final sample is likely to be 9,9,9,9. In this case all your boundaries are the same, so all pairs always go to Reducer #5.

In my example it looks like we were very unlucky to have the same last 4 keys. But if your original dataset is already sorted, this is likely to happen (and the boundary distribution is guaranteed to be bad) if you use a high frequency with a small number of samples.

This blog post has lots of details on Sampling and TotalOrderPartitioning.

Nicomak
  • 2,319
  • 1
  • 21
  • 23