7

In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For example:

  • logs/Customer_One/2011-01-02-001
  • logs/Customer_One/2012-02-03-001
  • logs/Customer_One/2012-02-03-002
  • logs/Customer_Two/2009-03-03-001
  • logs/Customer_Two/2009-03-03-002

Each individual log set may itself be five or six levels deep and contain thousands of files.

Therefore, I actually want the individual map jobs to handle walking the subdirectories: simply enumerating individual files is part of my distributed computing problem!

Unfortunately, when I try passing a directory containing only log subdirectories to Hadoop, it complains that I can't pass those subdirectories to my mapper. (Again, I have written to accept subdirectories as input):

$ hadoop jar "${HADOOP_HOME}/contrib/streaming/hadoop-streaming-${HADOOP_VERSION}.jar" -input file:///mnt/logs/Customer_Name/ -file mapper.sh -mapper "mapper.sh" -file reducer.sh -reducer "reducer.sh" -output .

[ . . . ]

12/04/10 12:48:35 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:java.io.IOException: Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 12/04/10 12:48:35 ERROR streaming.StreamJob: Error Launching job : Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 Streaming Command Failed! [cloudera@localhost ~]$

Is there a straightforward way to convince Hadoop-streaming to permit me to assign directories as work items?

Jon Lasser
  • 241
  • 2
  • 8
  • 1
    wildcarding / globs should work, try `-input file:///mnt/logs/Customer_*/**/*.log` – Chris White Apr 10 '12 at 20:32
  • Globbing isn't the answer: First, it would find only files at a given level in the directory tree rather than multiple levels; second, as I originally described the number of directories and subdirectories is enormous (in fact, well beyond the realms of a shell to expand without xargs) and the time to walk that tree is exactly part of the problem that I want distributed. (Just performing the globbing you're talking about would take days, literally, with 1 ms rtt.) – Jon Lasser Apr 10 '12 at 20:56
  • 1
    For a moment i had some recollection that hadoop supported recursive globs with the double-star (**) notation, but a quick test in my console says otherwise – Chris White Apr 10 '12 at 21:16

2 Answers2

2

I guess you need to investigate writing a custom InputFormat which you can pass the root directory too, it will create a split for each customer, and then the record reader for each split will do the directory walk and push the file contents to your mappers

Chris White
  • 29,949
  • 4
  • 71
  • 93
  • It's not clear to me that hadoop-streaming can accept any other inputformats. Can it? – Jon Lasser Apr 11 '12 at 20:51
  • http://hadoop.apache.org/mapreduce/docs/r0.21.0/streaming.html - see the `-inputformat` parameter – Chris White Apr 11 '12 at 21:04
  • Yep. But if I change it to a different existing InputFormat (e.g., org.apache.hadoop.mapred.KeyValueTextInputFormat) it still complains about "not a file." – Jon Lasser Apr 12 '12 at 00:44
  • I'm saying that you will need to write a custom Input format, a pre-canned hadoop doesn't exist for your use case. If you didn't have varying levels of directory structure and thousands of nested files and directories then this maybe you could – Chris White Apr 12 '12 at 01:27
2

Hadoop supports input paths to be a regular expression. I haven't experimented with a lot of complex regex, but the simple placeholders ? and * does work.

So in your case I think if you have the following as your input path it will work :

file:///mnt/logs/Customer_Name/*/*

The last asterisk might not be needed as all the files in the final directory are automatically added as input path.

Amar
  • 11,930
  • 5
  • 50
  • 73
  • May I know why was this down-voted? This is indeed a sweet and simple way to pass directory as your input path, just that you need to know the depth beforehand. I have used it successfully many a times. – Amar Jan 02 '14 at 11:04
  • It doesn't work. It finds only files at a given level. – Liton Aug 21 '16 at 19:46