0

Given the basic example from the mrJob site for a word count program:

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

From command line, this example can be run as python mrJobFilename.py mrJobFilename.py. This should run the program on itself and count the words in the file.

So given this example, what if I want to pass in an argument, say minCount = 3. With this argument, the reducer would only return words with counts more than minCount.

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        X = sum(values)
        if X > minCount:
            yield key, sum(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

I tried passing minWord as an argument : python mrJobFilename.py mrJobFilename.py 3, but I get an error OSError: Input path 3 does not exist!

I also tried setting a variable with sysArg:

if __name__ == '__main__':
    minWord = sys.argv[1]
    MRWordFrequencyCount.run()

When run with python mrJobFilename.py mrJobFilename.py < 3 I get an error bash: 3: No such file or directory. If I don't use the < I get the previous input file not found error.

Finally, I tried inputting a second csv file. The csv file is 2 lines and looks like this:

minWord
3

It is meant to pass a parameter to mrJobs since it keeps giving me error that second arugment is not an input file. I use mapper_raw to try and load it, but I get a weird error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8f in position 22: invalid start byte

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

    def mapper_raw(self, input_arg1, input_arg2):
        import csv
        f = open(input_path2)
        reader = csv.reader(f)
        next(reader) # skip header
        yield(next(reader))

    def steps(self):
          return [
              MRStep(mapper_raw=self.mapper_raw)
          ]


if __name__ == '__main__':
    MRWordFrequencyCount.run()

How can I pass an argument to mrJob? Ultimately I need to do this to pass parameters for differential equation systems which I want to solve in parallel.

Frank
  • 952
  • 1
  • 9
  • 23

1 Answers1

0

You can follow the mrjob document to add command-line argument like argparse.

So your code should look something like this:

from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def configure_args(self):
        super(MRWordFrequencyCount, self).configure_args()
        self.add_passthru_arg("-m", "--minCount", help="your argument description")

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        X = sum(values)
        if X > self.options.minCount:
            yield key, sum(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

Use your argument with self.options.minCount.

Run command:

python code.py input.txt --minCount 4
huy
  • 1,648
  • 3
  • 14
  • 40