I am learning hadoop mapreduce, and I am working with the Java API. I learnt about the TotalOrderPartitioner used to 'globally' sort the output by keys, across the cluster and that it needs a partition file (generated using InputSampler):
job.setPartitionerClass(TotalOrderPartitioner.class);
InputSampler.Sampler<Text, Text> sampler = new InputSampler.RandomSampler<Text, Text>(0.1, 200);
InputSampler.writePartitionFile(job, sampler);
I have a couple of doubts and I seek help from the community:
What does the word 'sorted globally' exactly mean here? How exactly is the output sorted, we still have multiple output part files that are distributed across the cluster?
What happens if we do not supply the partition file? Is there a default way to handle this situation?