5

What's the difference between Hadoop Streaming job and regular java job. Is there any advantage of using Hadoop streaming over the latter.

One more thing ,I am using mapreduce API (i.e, new API) and I heard that streaming only available with deprecated mapred API. Is it ?

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Tom Sebastian
  • 3,373
  • 5
  • 29
  • 54
  • http://stackoverflow.com/questions/1217850/streaming-data-and-hadoop-not-hadoop-streaming?rq=1 and http://stackoverflow.com/questions/7598422/is-it-better-to-use-the-mapred-or-the-mapreduce-package-to-create-a-hadoop-job?rq=1 – Eel Lee Oct 30 '13 at 11:51
  • 1
    please try google.com before posting for quick answers. – Praveen Sripati Oct 30 '13 at 15:30

1 Answers1

7

Hadoop streaming is advantageous for those cases when the developer do not have the much knowhow of Java and can write Mapper/Reducer in any scripting language faster.

When compared to custom jar jobs, a streaming Job would also have the additional overhead of starting a scripting(Python/Ruby/Perl) VM. This leads to a lot of inter-process communication, resulting in reduced efficiency of the jobs in most of the cases.

Using Hadoop streaming brings with it restrictions on the input/output formats. There are times when you would like to create custom input/output formats, using custom jars would be the natural choice. Also using Java one can over-ride/extend many of hadoop's functionalities to one's need/choice.

Quoting from an answer here:

Hadoop do has capability to work with MR jobs created in other languages - it is called streaming. This model only allow us to define mapper and reducer with some restrictions not present in java. In the same time - input/output formats and other plugins do have to be written as java classes So I would define decision making as following:

  • Use Java, unless you have serious codebase you need to resue in Your MR job.
  • Consider to use python when you need to create some simple ad hoc jobs.

As for streaming only available for mapred API, it doesn't make sense. While using streaming mappers/reducers are written in another languages, so no point worrying about which API hadoop internally will use in order to execute them.

Community
  • 1
  • 1
Amar
  • 11,930
  • 5
  • 50
  • 73
  • where could i find more details. Some links would be helpful – Tom Sebastian Oct 30 '13 at 12:00
  • 1
    consider using google, it is a great website, here you have the first hit if you type hadoop streaming: http://hadoop.apache.org/docs/r1.1.2/streaming.html – DDW Oct 30 '13 at 12:02
  • even this answer would help: http://stackoverflow.com/questions/6873077/streaming-or-custom-jar-in-hadoop/6889756#6889756 – Amar Oct 30 '13 at 12:06
  • I encountered a problem with streaming using mapred API. This is when tried to create a custom input format. When extended from TextInputFormat of new API, got an exception saying MyCustomeInputFormat is not mapred.TextInputFormat. I changed it to use old API and the problem got solved. Not sure if it will create any compatibility issues – sunitha Jan 09 '17 at 08:01