3

If in a file key distribution is like 99% of the words start with 'A' and 1% start with 'B' to 'Z' and you have to count the number of words starting with each letter, how would you distribute your keys efficiently?

Josh Crozier
  • 233,099
  • 56
  • 391
  • 304

1 Answers1

4

SOLUTION 1: I think the way to go is a combiner, rather than a partitioner. A combiner will aggregate the local sums of words starting with letter 'A' and then emit the partial sum (rather than number 1 always) to the reducers.

SOLUTION 2: However, if you insist on using a custom partitioner for this, you can simply handle words starting with letter 'A' in a separate reducer than all other words, i.e., dedicate a reducer only for words starting with letter 'A'.

SOLUTION 3: Moreover, if you don't mind "cheating" a little bit, you can define a counter for words starting with letter 'A' and increment it in the map phase. Then, just ignore those words (there is no need to send them through the network) and use the default partitioner for the other words. When the job finishes, retrieve the value of the counter.

SOLUTION 4: If you don't mind "cheating" even more, define 26 counters, one for each letter, and just increment them in the map phase, according to the first letter of the current word. You can use no reducers (set the number of reducers to 0). This will save all the sorting and shuffling. When the job finishes, retrieve the value of all the counters.

vefthym
  • 7,422
  • 6
  • 32
  • 58
  • The point is if we take this scenario that all the values with word starting with "a" will go to one reducer and which constitutes 99% of whole data which is quite large. in this case how will we use partitioner to distribute the keys "a" to different reducers to bring efficiency. – Anjul Tiwari May 13 '15 at 12:28
  • If you want to sum up the counts of words starting with the same letter, I believe that the first letter constitutes the intermediate key, i.e., they HAVE to be handled by the same reducer. What I described (in SOLUTION 2) is how you can avoid burdening this reducer even more. I don't see any other way to aggregate all the counts, except the ones I provided. If you have another solution, I will be glad to discuss how it can be distributed evenly. Do you have any code/idea that you want to optimize (if so, please add it to your original post)? – vefthym May 13 '15 at 12:31
  • I don't have code yet as I need to know the best logic to solve it. I have a solution that if we decide in partitioner that different lenght of words starting with "a" will go to different reducers then we can evenly distribute our data among reducers. Is it fair solution? Because the whole emphasis is that one reducer should not end up having 99% of data and other reducers only 1%. Though 99% data constitute words starting with "a" but that should not go to only one reducer instead to many reducers. – Anjul Tiwari May 13 '15 at 12:35
  • This would distribute the load fairly (maybe), but then, how would you get the counts of words starting with A from different reducers? With an additional MR job? – vefthym May 13 '15 at 12:37
  • Then, you can do what you described, or even partition based on the first and second, or first and third, or first and last letter. However, using an additional MR job, might bring an unnecessary overhead. I am glad that I helped. Good luck! – vefthym May 13 '15 at 12:40
  • I have one more doubt that if 99% of values associated to a single key, would the mapreduce job fail because of huge data? – Anjul Tiwari May 13 '15 at 12:42
  • That could be true, yes, depending on the size of the data and the resources available. – vefthym May 13 '15 at 12:43