Map/reduce two-stage ordering of counts

Question

This python3 program attempts to produce a frequency list of words from a text file using map/reduce. I would like to know how to order the word counts, represented as 'count' in the second reducer's yield statement so that the largest count values appear last. Currently, the tail of the results look like this:

"0002"  "wouldn"
"0002"  "wrap"
"0002"  "x"
"0002"  "xxx"
"0002"  "young"
"0002"  "zone"

For context, I pass any word text file into the python3 program like this:

python MapReduceWordFreqCounter.py book.txt

Here is the code for MapReduceWordFreqCounter.py:

from mrjob.job import MRJob
from mrjob.step import MRStep
import re

# ignore whitespace characters
WORD_REGEXP = re.compile(r"[\w']+")

class MapReduceWordFreqCounter(MRJob):

    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_words,
                   reducer=self.reducer_count_words),
            MRStep(mapper=self.mapper_make_counts_key,
                   reducer = self.reducer_output_words)
        ]

    def mapper_get_words(self, _, line):
        words = WORD_REGEXP.findall(line)
        for word in words:
            yield word.lower(), 1

    def reducer_count_words(self, word, values):
        yield word, sum(values)

    def mapper_make_counts_key(self, word, count):
        yield str(count).rjust(4,'0'), word

    def reducer_output_words(self, count, words):
        for word in words:
            yield count, word

if __name__ == '__main__':
    MapReduceWordFreqCounter.run()

score 1 · Answer 1 · answered Feb 04 '17 at 14:14

You have to set custom sort comparator for your job.

If you wrote it in java, it would look like

job.setSortComparatorClass(SortKeyComparator.class);

and you'll have to provide a class that gives reverse order

public class SortKeyComparator extends Text.Comparator {

    @Override
    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
        return (-1) * super.compare(b1, s1, l1, b2, s2, l2);
    }
}

I guess python hadoop api has some simular methods for doing this trick.

score 0 · Answer 2 · answered Feb 05 '17 at 23:26

For the MRJob Reduce step, there is no expectation that the results should be ordered by the key 'count'.

Here, the MRJob import allows you to run the code locally and on an AWS Elastic MapReduce cluster. MRJob does the heavy lifting for execution as it uses the Yarn API and Hadoop streaming for distributed data transfer between the map and reduce jobs.

For example, to run locally, you can run this MRJob as: python MapReduceWordFreqCounter.py books.txt > counts.txt

To run on a single EMR node: python MapReduceWordFreqCounter.py -r emr books.txt > counts.txt

To run on 25 EMR nodes: python MapReduceWordFreqCounter.py -r emr --num-ec2-instances=25 books.txt > counts.txt

To troubleshoot the distributed EMR job (substitute your job ID): python -m mrjob.tools.emr.fetch_logs --find-failure j-1NXEMBAEQFDFT

Here, when running on four nodes, the reduced results are ordered but are in four different sections in the output file. It turns out that coercing the reducer into producing a single ordered file doesn't have performance advantages over just ordering the results in a post run job step. Thus one way to solve this specific question is to use the Linux command sort:

sort word_frequency_list.txt > sorted_word_frequency_list.txt

That produces these 'tailed' results:

"0970" "of" "1191" "a" "1292" "the" "1420" "your" "1561" "you" "1828" "to"

More generally, there are frameworks on top of Hadoop that are ideal for this sort of processing. For this problem, Pig can be used read in the processed file and order the counts.

Pig can be run via the Grunt shell or via Pig scripts (using the case sensitive Pig Latin syntax). Pig scripts follow the following template: 1) LOAD statement to read data 2) A series of 'transformation' statement to process data 3) A DUMP/STORE statement to save results

To order the counts using pig:

reducer_count_output = LOAD 'word_frequency_list.txt' using PigStorage('  ') AS (word_count:chararray, word_name:chararray);
counts_words_ordered = ORDER reducer_count_output BY word_count ASC;
STORE counts_words_ordered INTO 'counts_words_ordered' USING PigStorage(':', '-schema');

Map/reduce two-stage ordering of counts

2 Answers2