Pass directories not files to hadoop-streaming?

Question

In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For example:

logs/Customer_One/2011-01-02-001
logs/Customer_One/2012-02-03-001
logs/Customer_One/2012-02-03-002
logs/Customer_Two/2009-03-03-001
logs/Customer_Two/2009-03-03-002

Each individual log set may itself be five or six levels deep and contain thousands of files.

Therefore, I actually want the individual map jobs to handle walking the subdirectories: simply enumerating individual files is part of my distributed computing problem!

Unfortunately, when I try passing a directory containing only log subdirectories to Hadoop, it complains that I can't pass those subdirectories to my mapper. (Again, I have written to accept subdirectories as input):

$ hadoop jar "${HADOOP_HOME}/contrib/streaming/hadoop-streaming-${HADOOP_VERSION}.jar" -input file:///mnt/logs/Customer_Name/ -file mapper.sh -mapper "mapper.sh" -file reducer.sh -reducer "reducer.sh" -output .

[ . . . ]

12/04/10 12:48:35 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:java.io.IOException: Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 12/04/10 12:48:35 ERROR streaming.StreamJob: Error Launching job : Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 Streaming Command Failed! [cloudera@localhost ~]$

Is there a straightforward way to convince Hadoop-streaming to permit me to assign directories as work items?

wildcarding / globs should work, try `-input file:///mnt/logs/Customer_*/**/*.log` — Chris White, Apr 10 '12 at 20:32
Globbing isn't the answer: First, it would find only files at a given level in the directory tree rather than multiple levels; second, as I originally described the number of directories and subdirectories is enormous (in fact, well beyond the realms of a shell to expand without xargs) and the time to walk that tree is exactly part of the problem that I want distributed. (Just performing the globbing you're talking about would take days, literally, with 1 ms rtt.) — Jon Lasser, Apr 10 '12 at 20:56
For a moment i had some recollection that hadoop supported recursive globs with the double-star (**) notation, but a quick test in my console says otherwise — Chris White, Apr 10 '12 at 21:16

score 2 · Answer 1 · answered Apr 10 '12 at 21:04

2

I guess you need to investigate writing a custom InputFormat which you can pass the root directory too, it will create a split for each customer, and then the record reader for each split will do the directory walk and push the file contents to your mappers

answered Apr 10 '12 at 21:04

Chris White

29,949
4
71
93

It's not clear to me that hadoop-streaming can accept any other inputformats. Can it? – Jon Lasser Apr 11 '12 at 20:51
http://hadoop.apache.org/mapreduce/docs/r0.21.0/streaming.html - see the `-inputformat` parameter – Chris White Apr 11 '12 at 21:04
Yep. But if I change it to a different existing InputFormat (e.g., org.apache.hadoop.mapred.KeyValueTextInputFormat) it still complains about "not a file." – Jon Lasser Apr 12 '12 at 00:44
I'm saying that you will need to write a custom Input format, a pre-canned hadoop doesn't exist for your use case. If you didn't have varying levels of directory structure and thousands of nested files and directories then this maybe you could – Chris White Apr 12 '12 at 01:27

score 2 · Answer 2 · answered Mar 31 '13 at 17:54

2

Hadoop supports input paths to be a regular expression. I haven't experimented with a lot of complex regex, but the simple placeholders ? and * does work.

So in your case I think if you have the following as your input path it will work :

file:///mnt/logs/Customer_Name/*/*

The last asterisk might not be needed as all the files in the final directory are automatically added as input path.

answered Mar 31 '13 at 17:54

Amar

11,930
5
50
73

May I know why was this down-voted? This is indeed a sweet and simple way to pass directory as your input path, just that you need to know the depth beforehand. I have used it successfully many a times. – Amar Jan 02 '14 at 11:04
It doesn't work. It finds only files at a given level. – Liton Aug 21 '16 at 19:46

Pass directories not files to hadoop-streaming?

2 Answers2

Linked