Using FileInputFormat.addInputPaths to recursively add HDFS path

Question

I've got a HDFS structure something like

a/b/file1.gz
a/b/file2.gz
a/c/file3.gz
a/c/file4.gz

I'm using the classic pattern of

FileInputFormat.addInputPaths(conf, args[0]);

to set my input path for a java map reduce job.

This works fine if I specify args[0] as a/b but it fails if I specify just a (my intention being to process all 4 files)

the error being

Exception in thread "main" java.io.IOException: Not a file: hdfs://host:9000/user/hadoop/a

How do I recursively add everything under a ?

I must be missing something simple...

score 5 · Answer 1 · edited May 23 '17 at 10:28

As Eitan Illuz mentioned here, in Hadoop 2.4.0 a mapreduce.input.fileinputformat.input.dir.recursive configuration property was introduced that when set to true instructs the input format to include files recursively.

In Java code it looks like this:

Configuration conf = new Configuration();
conf.setBoolean("mapreduce.input.fileinputformat.input.dir.recursive", true);
Job job = Job.getInstance(conf);
// etc.

I've been using this new property and find that it works well.

EDIT: Better yet, use this new method on FileInputFormat that achieves the same result:

Job job = Job.getInstance();
FileInputFormat.setInputDirRecursive(job, true);

conf.setBoolean("mapreduce.input.fileinputformat.input.dir.recursive", true); worked for me in Hadoop 2.2.0 also — Alex, Jul 07 '16 at 17:06

score 2 · Accepted Answer · answered Nov 14 '11 at 04:59

This is a bug in the current version of Hadoop. Here is the JIRA for the same. It's still in open state. Either make the changes in the code and build the binaries or wait for it to be fixed in the coming releases. Processing of the files recursively can be turned on/off, check the patch attached to the JIRA for more details.

Using FileInputFormat.addInputPaths to recursively add HDFS path

2 Answers2

Linked