Given the basic example from the mrJob site for a word count program:
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run()
From command line, this example can be run as python mrJobFilename.py mrJobFilename.py
. This should run the program on itself and count the words in the file.
So given this example, what if I want to pass in an argument, say minCount = 3
. With this argument, the reducer would only return words with counts more than minCount
.
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
X = sum(values)
if X > minCount:
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run()
I tried passing minWord as an argument : python mrJobFilename.py mrJobFilename.py 3
, but I get an error OSError: Input path 3 does not exist!
I also tried setting a variable with sysArg:
if __name__ == '__main__':
minWord = sys.argv[1]
MRWordFrequencyCount.run()
When run with python mrJobFilename.py mrJobFilename.py < 3
I get an error bash: 3: No such file or directory
. If I don't use the <
I get the previous input file not found error.
Finally, I tried inputting a second csv file. The csv file is 2 lines and looks like this:
minWord
3
It is meant to pass a parameter to mrJobs since it keeps giving me error that second arugment is not an input file. I use mapper_raw to try and load it, but I get a weird error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8f in position 22: invalid start byte
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper_raw(self, input_arg1, input_arg2):
import csv
f = open(input_path2)
reader = csv.reader(f)
next(reader) # skip header
yield(next(reader))
def steps(self):
return [
MRStep(mapper_raw=self.mapper_raw)
]
if __name__ == '__main__':
MRWordFrequencyCount.run()
How can I pass an argument to mrJob? Ultimately I need to do this to pass parameters for differential equation systems which I want to solve in parallel.