How do you filter s3 files before sending input to mrjob mapper?

Question

I'm trying to MapReduce logs, and I'd like to filter all logs in a bucket by filename before processing them in EMR. Also, some files are tar directories, and I'd like mrjob to uncompress it, then filter files in it to only parse the relevant one.

Any idea how to filter a bucket s3 by filename from MrJob? I found the mapper_pre_filter! method, but it only filter the input line by line.

I'm not sure about filtering of files, but you can use `s3distcp` job to filter and move required files in specific bucket. — mr0re1, Jun 16 '14 at 05:48

score 1 · Accepted Answer · answered Jun 24 '14 at 20:09

There are several possibilities here.

When you run a job in MRjob, you can specify the files to input in the command line. If you wish to run only .json files, you can do that by specifying .json, or if you want to run everything starting with Logs6-24-14, and ending with .txt, you specify Logs6-24-14.txt
Alternatively, you can pass the files you want as data; this involves unix pipes, and can be incredibly powerful.
Lastly, and possibly most flexibly, you could write python code in your MRjob file that runs before the actual job, which pre-processes the data. You could do almost anything in that way. It would go into the code here:

if __name__ == '__main__':
    Arbitrary_code_function.run()
    MRJOB_Jobname.run()

How do you filter s3 files before sending input to mrjob mapper?

1 Answers1