Python mrjob mapreduce how to preprocess the input file

Question

I am trying to pre-process a XML file to extract certain nodes before putting into mapreduce. I have the following code:

from mrjob.compat import jobconf_from_env
from mrjob.job import MRJob
from mrjob.util import cmd_line, bash_wrap

class MRCountLinesByFile(MRJob):
    def configure_options(self):
        super(MRCountLinesByFile, self).configure_options()
        self.add_file_option('--filter')

    def mapper_cmd(self):
        cmd = cmd_line([self.options.filter, jobconf_from_env('mapreduce.map.input.file'])
        return cmd



if __name__ == '__main__':
    MRCountLinesByFile.run()

And on the command line, I type:

python3 test_job_conf.py --filter ./filter.py -r local < test.txt

test.txt is a normal XML file like here. While filter.py is a script to find all title information.

However, I am getting the following errors:

Creating temp directory /tmp/test_job_conf.vagrant.20160406.042648.689625
Running step 1 of 1...
Traceback (most recent call last):
  File "./filter.py", line 8, in <module>
    with open(filename) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'None'
Step 1 of 1 failed: Command '['./filter.py', 'None']' returned non-zero exit status 1

It looks like mapreduce.map.input.file render None in this case. How can I ask the mapper_cmd function to read the file that mrjob is currently reading?

score 0 · Answer 1 · answered Jun 04 '16 at 21:30

As per my understanding goes in the your self.add_file_option should have the path to your file.

self.add_file_option('--items', help='Path to u.item')

I do not quite get your scenario right but here is my understanding. You use the configure option to make sure a given file is sent to all the mappers for processing for example when you want to do an ancillary lookup on data in another file other than the source. This ancillary lookup file is made available by self.add_file_option('--items', help='Path to u.item').

To preprocess something say before a reducer or a mapper phase, you use the reducer_init or the mapper_init. These init or the processing steps also need to be mentioned in your step function like shown below for example.

def steps(self):
        return [
            MRStep(mapper=self.mapper_get_name,
                   reducer_init=self.reducer_init,
                   reducer=self.reducer_count_name),
            MRStep(reducer = self.reducer_find_maxname)
        ]

Within your init function you do the actual pre-processing of what you need to done before sending to mapper or reducer. Say for example open a file xyz and copy the values in the first field in another field which I would be using in my reducer and output the same.

def reducer_init(self):
        self.movieNames = {}    
        with open("xyz") as f:
            for line in f:
                fields = line.split('|')
                self.myNames[fields[0]] = fields[1]

Hope this helps!!

Python mrjob mapreduce how to preprocess the input file

1 Answers1