3

I am using Python and have to work on following scenario using Hadoop Streaming: a) Map1->Reduce1->Map2->Reduce2 b) I dont want to store intermediate files c) I dont want to install packages like Cascading, Yelp, Oozie. I have kept them as last option.

I already went through the same kind of discussion on SO and elsewhere but could not find an answer wrt Python. Can you please suggest.

Piyush Kansal
  • 1,201
  • 4
  • 18
  • 26

2 Answers2

3

b) I dont want to store intermediate files

c) I dont want to install packages like Cascading, Yelp, Oozie.

Any reason why? Based on the response, a better solution could be provided.

Intermediates files cannot be avoided, because the o/p of the previous Hadoop job cannot be streamed as i/p to the next job. Create a script like this

run streaming job1
if job1 is not success then exit
run streaming job2
if job2 is success them remove o/p of job1 else exit
run streaming job3
if job3 is succcess them remove o/p of job2 else exit

Community
  • 1
  • 1
Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117
  • - Because I have just started learning Hadoop (for a project) and this how I am supposed to go - For "job1", "job2" etc do I need to define some jobs? - And as you mentioned, I will try your approach. But in this case, just writing a script will do? Am I not suppose to use it with some command line like "hadoop *streaming*.jar -input -ouput -mapper -reducer " – Piyush Kansal Jan 14 '12 at 08:48
2

Why not using MapReduce frameworks for python streaming, like Dumbo https://github.com/klbostee/dumbo/wiki/Short-tutorial, or MRJob http://packages.python.org/mrjob/

For example, with dumbo, your pipe would be:

job.add_iter(Mapper1, Reducer1)
job.add_iter(Mapper2, Reducer2)
user1151446
  • 1,845
  • 3
  • 15
  • 22