How to prematurely terminate MrJob reducer?

Question

I want to use MapReduce to filter a huge dataset for rare entities satisfying some criteria. I could speed this up a lot by terminating reducers once they violate the criteria, since they will be computing on entities that I'm not interested in.

To make up an example, say I have a corpus with billions of articles, and I want to return only articles with fewer than 100 words. The vast majority of articles have >100,000 words, so I can skip most of the work by terminating the reducers once it reaches the stopping criteria (word_count >100).

score 0 · Accepted Answer · answered Mar 31 '18 at 20:20

This doesn't terminate the reducer, but it would stop it from recieving any new jobs. It works by maintaining a count of the features as a class dictionary:

from mrjob.job import MRJob    

class Mr_Count_Words(MRJob):
    feature_counts = {}

    def mapper(self, _, line):
            ...

Then, somewhere you can compute the features and check the dictionary to see if you've converged:

try:
    self.feature_counts[feature_name] += 1
except KeyError:
    self.feature_counts[feature_name] = 1

if self.feature_counts[feature_name] > feature_thresh:
    return None
else:        
    yield ('feature_name', 1)

How to prematurely terminate MrJob reducer?

1 Answers1