hadoop input format for hadoop streaming. Wikihadoop Input Format

Question

I wonder whether there is any differences between the InputFormats for hadoop and hadoop streaming. Does the Input Formats for hadoop streaming work also for hadoop and vice versa? I am asking this because I found a special Input Format for the wikipedia dump files, the wikihadoop InputFormat. And there it is written that it is an Input Format for hadoop streaming? Why only for hadoop streaming? And not for hadoop?

Bests

score 0 · Answer 1 · answered Jun 14 '13 at 15:51

0

As far as I know, there is no difference in how inputs are processed between Hadoop streaming jobs and regular MapReduce jobs written in Java.

The inheritance tree for StreamWikiDumpInputFormat is...

* InputFormat
  * FileInputFormat
    * KeyValueTextInputFormat
      * StreamWikiDumpInputFormat

And since it eventually implements InputFormat, it can be used in regular MapReduce jobs

answered Jun 14 '13 at 15:51

Mike Park

10,845
2
34
50

but why it is said on their wiki page: "This software provides an InputFormat for Hadoop Streaming Interface that processes Wikipedia bzip2 XML dumps in a streaming manner" – user2426139 Jun 14 '13 at 15:59
1

I see no implication that it is streaming *only*. I just see an InputFormat that was written by someone who is only interested in the streaming part of Hadoop, and so he describes it using streaming terms. – Mike Park Jun 14 '13 at 16:11
1

If I'm wrong, you'll probably know as soon as you try to use it the first time – Mike Park Jun 14 '13 at 17:33

score 0 · Answer 2 · answered Jun 15 '13 at 01:01

No..Type of MR job(streaming or java) is not the criteria for using(or developing) an InputFormat. An InputFormat is just an InputFormat and will work for both streaming and java MR jobs. It is type of the data, which you are going to process, based on which you use(or develop) an InputFormat. Hadoop natively provides different types of InputFormats which are normally sufficient to handle your needs. But sometimes your data is in such a state that none of these InputFormats are able to handle it.

Having said that, it is still possible to process that data using MR, and this is where you end up writing your own custom InputFormat, as the one you have specified above.

And I don't know why they have emphasized on Hadoop Streaming so much. It's just a Java class which does everything an InputFormat should do and implements everything which makes it eligible to do so. @climbage has made a very valid point regarding the same. So, it can be used with any MR job, streaming or java.

score 0 · Answer 3 · answered Jan 18 '16 at 14:49

There is no difference between usual input formats and the one which were developed for a hadoop streaming.

When the author says that the format is developed for Hadoop Streaming the only thing she meant that her input format produces objects with a meaningfull toString methods. That's it.

For example, when I develop a input format for usage in Hadoop Streaming I try to avoid BinaryWritable and use Text instead.

hadoop input format for hadoop streaming. Wikihadoop Input Format

3 Answers3