MapReduce with filename as Key, contents as Values, many small files

Question

I've looked at FileInputFormat where filename is KEY and text contents are VALUE, How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?, and Getting Filename/FileData as key/value input for Map when running a Hadoop MapReduce Job, but I'm having trouble getting off the ground. Not having done anything with Hadoop before, I'm wary of starting down the wrong path if someone else can see that I'm making a mistake.

I have a directory containing something like 100K small files containing HTML, and I want to create an inverted index using Amazon Elastic MapReduce, implemented in Java. Once I have the file contents, I know what I want my map and reduce functions to do.

After looking here, my understanding is I need to subclass FileInputFormat and override isSplitable. However, my filenames are related to the URLs from which the HTML came, so I want to keep them. Is replacing NullWritable with Text all I need to do? Any other advice?

score 0 · Answer 1 · answered Dec 07 '15 at 08:47

You should use WholeFileInputFormat to pass the whole file to your mapper

conf.setInputFormat(WholeFileInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path("input"));
FileOutputFormat.setOutputPath(conf,new Path("output"));

MapReduce with filename as Key, contents as Values, many small files

1 Answers1