0

I want to use MapReduce to filter a huge dataset for rare entities satisfying some criteria. I could speed this up a lot by terminating reducers once they violate the criteria, since they will be computing on entities that I'm not interested in.

To make up an example, say I have a corpus with billions of articles, and I want to return only articles with fewer than 100 words. The vast majority of articles have >100,000 words, so I can skip most of the work by terminating the reducers once it reaches the stopping criteria (word_count >100).

Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
crypdick
  • 16,152
  • 7
  • 51
  • 74

1 Answers1

0

This doesn't terminate the reducer, but it would stop it from recieving any new jobs. It works by maintaining a count of the features as a class dictionary:

from mrjob.job import MRJob    

class Mr_Count_Words(MRJob):
    feature_counts = {}

    def mapper(self, _, line):
            ...

Then, somewhere you can compute the features and check the dictionary to see if you've converged:

try:
    self.feature_counts[feature_name] += 1
except KeyError:
    self.feature_counts[feature_name] = 1

if self.feature_counts[feature_name] > feature_thresh:
    return None
else:        
    yield ('feature_name', 1)
crypdick
  • 16,152
  • 7
  • 51
  • 74