2

I want to write a MapReduce application which can process both text and zip files. For this I want to use to different input formats, one for text and another for zip. Is it possible to do so?

j0k
  • 22,600
  • 28
  • 79
  • 90
aa8y
  • 3,854
  • 4
  • 37
  • 62
  • I believe hadoop can seamlessly read both text and gzip files given together as input. have you tried this out ? – Amar Jan 15 '13 at 10:49
  • 1
    Hadoop can process seamlessly text and gzip, but not zip files (also they are not splittable). – Charles Menguy Jan 15 '13 at 16:48

3 Answers3

6

Extending a bit from @ChrisWhite's answer, what you need is to use a custom InputFormat and RecordReader that work with ZIP files. You can find here a sample ZipFileInputFormat and here a sample ZipFileRecordReader.

Given this, as Chris suggested you should use MultipleInputs, and here is how I would do it if you don't need custom mappers for each type of file:

MultipleInputs.addInputPath(job, new Path("/path/to/zip"), ZipFileInputFormat.class);
MultipleInputs.addInputPath(job, new Path("/path/to/txt"), TextInputFormat.class);
Charles Menguy
  • 40,830
  • 17
  • 95
  • 117
  • +1 nice link to an implementation. Always amazes me that these things aren't already part of the Hadoop base – Chris White Jan 15 '13 at 17:29
  • @ChrisWhite Yeah that also surprises me, actually I found that there seems to be a Jira for this that's been sitting for a while... https://issues.apache.org/jira/browse/MAPREDUCE-210 – Charles Menguy Jan 15 '13 at 18:11
  • The implementation looks fool proof, but when I tried to add the input path for ZipFileInputFormat class, I received the error "The method addInputPath(JobConf, Path, Class extends InputFormat>) in the type MultipleInputs is not applicable for the arguments (JobConf, Path, Class)". If I am not wrong, the license of the this class specifies that I can use it, but only w/o any modifications. So how do I handle this? – aa8y Jan 16 '13 at 05:25
  • @Expressions_Galore The `ZipFileInputFormat` is written using the new Hadoop API, and similarly you have `MultipleInputs` for both the old API and new API, so make sure you are using the `MultipleInputs` from the new API as well, that should fix it. – Charles Menguy Jan 16 '13 at 05:34
  • I am using Cloudera Hadoop (CDH4) and it seems like it does not fully support the new API. So my entire MapReduce code is written using the older mapred API. Just to clarify, the license does say that I can't modify the source right, because otherwise I can write my own classes taking it as a reference. – aa8y Jan 16 '13 at 05:36
  • I'm pretty sure CDH4 supports the new API, i'm still using CDH3 and working with the new API. As a general recommendation I would recommend writing your jobs using the new API `org.apache.hadoop.mapreduce` instead of the old API `org.apache.hadoop.mapred`, especially since you're using Cloudera's version. – Charles Menguy Jan 16 '13 at 05:39
  • @Expressions_Galore Just to make sure we're on the same line, i'm not talking about using YARN (also called MRV2), but to use the new API `org.apache.hadoop.mapreduce` of MRV1 which should be definitely supported. – Charles Menguy Jan 16 '13 at 05:51
  • Charles Menguy: This is the error I was talking about: "Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected". After searching on the internet I found that it's an issue with Cloudera's version of Hadoop which will be fixed in subsequent updates. – aa8y Jan 16 '13 at 08:00
  • @Expressions_Galore at which point do you get this error? I didn't have any issue compiling with cdh3u5 – Charles Menguy Jan 17 '13 at 03:45
  • @Charles Menguy I should have mentioned y'day. I got the solution: http://stackoverflow.com/questions/14354309/handling-error-found-interface-org-apache-hadoop-mapreduce-taskattemptcontext I have just compiled it though. I'll accept the answer as soon as I successfully test it. Can you come up on chat though? – aa8y Jan 17 '13 at 04:01
  • @Expressions_Galore Nice, glad to see you are able to run this now :) – Charles Menguy Jan 17 '13 at 04:03
3

Look at the API docs for MultipleInputs (old api, new api). Not hugely self explanatory, but you should be able to see that you call the addInputPath methods in your job configuration and configure the input path (which can be a glob, input format and associated mapper).

You should be able to Google for some examples, infact here's a SO question / answer that shows some usage

Community
  • 1
  • 1
Chris White
  • 29,949
  • 4
  • 71
  • 93
0

Consider writing a custom InputFormat where you can check what kind of input is being read and then based on the check invoke the required InputFormat

RadAl
  • 404
  • 5
  • 23