I am trying to implement a mapreduce job with three steps and after each step I need the data from all of the steps so far. Does anyone have an example/idea about how I can save results of mapper or reducers to disk in mrjob?
Asked
Active
Viewed 1,259 times
1 Answers
1
You can pass multiple inputs into a job, simply take the output of the previous job as input.
When you say you'd like to save results to disk, it sounds like you're relying on the output being streamed back to stdout? That behavior is just a convenience (and can be turned off), with MRJob everything bounces off disk.
For a two stage job you could do this:
job1 = firstMR(['-r', mode, inputDir, '-o', outputDir, '--no-output'])
job1.set_up_logging()
with job1.make_runner() as runner1:
runner1.run()
firstOutput = runner1.get_output_dir()
job2 = secondMR(['-r', mode, firstOutput, anyOtherInput, '-o', finalOutputDir, '--no-output'])
job2.set_up_logging()
with job2.make_runner() as runner2:
runner2.run()
Somethings to note:
- running on hadoop, all of the directories should probably be hdfs://some/path/
- any arguments to the MapReduce that aren't flags, and aren't preceeded by an option are considered input files or directories
- use --no-output to stop the output coming back on stdout (I used it in the first step above, and you probably don't want the interim results, but left it out in the second to demonstrate the difference). In your case of needing three steps, you may leave it out on the first two, and in for the third. Alternatively write the output of the third step to a folder you can read back easily.
Let me know if you hit any snags, it should be relatively straight forward.

Evin
- 321
- 3
- 5
-
If you use `with job1.make_runner() as runner1: ` then wouldn't the `firstOutput` directory be cleaned up after leaving the `with` scope? Shouldn't job2 be inside the scope of job1 `with` statement? reference: https://pythonhosted.org/mrjob/runners-runner.html#mrjob.runner.MRJobRunner.cleanup – Andy Apr 13 '15 at 07:51