Using MapReduce to read the files within a directory

Question

My S3 directory is

/sssssss/xxxxxx/rrrrrr/xx/file1
/sssssss/xxxxxx/rrrrrr/xx/file2
/sssssss/xxxxxx/rrrrrr/xx/file3
/sssssss/xxxxxx/rrrrrr/yy/file4
/sssssss/xxxxxx/rrrrrr/yy/file5
/sssssss/xxxxxx/rrrrrr/yy/file6

How my mapreduce program to read these files on S3?

Clarify what's intended by the title – Bill Bell Feb 15 '17 at 14:01 — Bill Bell, Feb 15 '17 at 14:01

score 0 · Accepted Answer · answered Feb 17 '17 at 09:19

For one input path you do the following:

FileInputFormat.addInputPath(job, new Path("/sssssss/xxxxxx/rrrrrr/xx/"));

For two input paths, you do the following:

FileInputFormat.addInputPath(job, new Path("/sssssss/xxxxxx/rrrrrr/xx/"));
FileInputFormat.addInputPath(job, new Path("/sssssss/xxxxxx/rrrrrr/yy/"));

or use addInputPaths(). See the documentation of FileInputPath (depending on your version of Hadoop) for more details.

score 0 · Answer 2 · answered Feb 17 '17 at 14:18

It can be simplified by the following way :-

FileInputFormat.setInputDirRecursive(job, true);
FileInputFormat.addInputPaths(conf, args[0]);

You just need to give the base path of the s3 dir and not the exact location of each and every file. It will go to the last dir which contains file.

Using MapReduce to read the files within a directory

2 Answers2