python client on Os X streaming on hadoop sandbox

Question

I would like to write mapreduce code - ideally using python - on my apple mac to streaming it on a hadoop sandbox (e.g. Hortonworks or Cloudera).

Ideally my development setup is using my Apple Mac python environment & an hadoop VM sandbox (later a cluster on the same network).

While there are many description on how to connect or stream code from within a node of the hadoop cluster (e.g. from the NameNode etc.), I am unclear on what to do from outside of the cluster.

E.g. I assume I need to install some hadoop client libraries? Where do I get these libraries from?

How do I install them?

What type of python package works best?

What IP address should I use to stream my python code?

Any help - and any link to a tutorial covering this - would be great!

UPDATE - My next trials are going to be using the WebHDFS API, that seems suitable for what I need. — Enzo, Jan 07 '14 at 12:44

score 0 · Answer 1 · answered Dec 30 '13 at 04:18

You are correct that you need to install client libraries to submit jobs.

Unfortunately, trying to submit your streaming jobs in OS X probably isn't the best choice. I say that because there aren't any vendor supported packages for OS X, so it isn't the easiest platform to install Hadoop onto, at least in a vendor supported way. If you are already going to have the sandbox setup, just write the job on your Mac and submit in the VM.

There are options though for installing if you have to. You can use homebrew, although I'm not sure what version will be installed or if there are vendor specific formulas available. You could also download and build hadoop yourself for example using the Cloudera tarballs here. Once you have the client setup you will have to configure mapred-site.xml, core-site.xml, and hdfs-site.xml to talk to the cluster running inside of your sandbox VM.

My understanding - say in the case of Cloudera - was to get client files as clarified here http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.5.3/Cloudera-Manager-Enterprise-Edition-User-Guide/cmeeug_topic_5_9.html - Is there any tutorial to explain how to configure mapred-site.xml etc.? — Enzo, Jan 03 '14 at 10:30

python client on Os X streaming on hadoop sandbox

1 Answers1