Restrict Hadoop MapReduce to Specific File Extension

Question

I am trying to run a MapReduce job on my cluster that only runs on a specific file extension. We have a bunch of heterogeneous data that sits on the cluster and for this particular job I only want to execute on .jpg. Is there a way this can be done without restricting it in the mapper. It seems like this should be something easy to do when you execute the job. I'm thinking something like hadoop fs JobName /users/myuser/data/*.jpg /users/myuser/output.

score 3 · Accepted Answer · answered May 01 '12 at 13:29

3

Your example should work as written, but you'll want to check with the input format that you're calling the setInputPaths(Job, String) method, as this will resolve the glob string "/users/myuser/data/*.jpg" into the individual jpg files in /users/myuser/data.

answered May 01 '12 at 13:29

Chris White

29,949
4
71
93

You are correct, is there a way to make this recursive though? I want to run it from the root of HDFS. – Matt E May 01 '12 at 13:58
Looks like this is a common problem and has been patched. Here is a link [link](http://stackoverflow.com/questions/8114579/using-fileinputformat-addinputpaths-to-recursively-add-hdfs-path). Thanks for answering my original question! – Matt E May 01 '12 at 14:57
1

If you have a fixed number of directories, you can recurse: `/users/myuser/data/*/*/*.jpg` will match all jpg files 2 dirs deep from `/users/myuser/data/`. As you point out though, variable depth globing (such as `/users/myuser/data/**/*.jpg`) isn't support yet. – Chris White May 01 '12 at 18:52

Restrict Hadoop MapReduce to Specific File Extension

1 Answers1