Can we cascade multiple MapReduce jobs in Hadoop Streaming (lang: Python)

Question

I am using Python and have to work on following scenario using Hadoop Streaming: a) Map1->Reduce1->Map2->Reduce2 b) I dont want to store intermediate files c) I dont want to install packages like Cascading, Yelp, Oozie. I have kept them as last option.

I already went through the same kind of discussion on SO and elsewhere but could not find an answer wrt Python. Can you please suggest.

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

3

b) I dont want to store intermediate files

c) I dont want to install packages like Cascading, Yelp, Oozie.

Any reason why? Based on the response, a better solution could be provided.

Intermediates files cannot be avoided, because the o/p of the previous Hadoop job cannot be streamed as i/p to the next job. Create a script like this

run streaming job1
if job1 is not success then exit
run streaming job2
if job2 is success them remove o/p of job1 else exit
run streaming job3
if job3 is succcess them remove o/p of job2 else exit

edited Jun 20 '20 at 09:12

Community

1
1

answered Jan 14 '12 at 06:50

Praveen Sripati

32,799
16
80
117

- Because I have just started learning Hadoop (for a project) and this how I am supposed to go - For "job1", "job2" etc do I need to define some jobs? - And as you mentioned, I will try your approach. But in this case, just writing a script will do? Am I not suppose to use it with some command line like "hadoop *streaming*.jar -input -ouput -mapper -reducer " – Piyush Kansal Jan 14 '12 at 08:48

score 2 · Answer 2 · answered Oct 22 '12 at 10:09

Why not using MapReduce frameworks for python streaming, like Dumbo https://github.com/klbostee/dumbo/wiki/Short-tutorial, or MRJob http://packages.python.org/mrjob/

For example, with dumbo, your pipe would be:

job.add_iter(Mapper1, Reducer1)
job.add_iter(Mapper2, Reducer2)

Can we cascade multiple MapReduce jobs in Hadoop Streaming (lang: Python)

2 Answers2

Linked