1

I have a huge log file which contains directory paths as one of the columns. For instance,

/
/a 
/a/b
/a/b/e
/d
/d/f
/e

There are no duplicate lines in the log.

My question is, using Pig, how do I count the number of sub-directories under each directory without counting the same directory name more than once? In the example above, the desired result would be somewhat like the following,

/ 6
/a 2
/a/b 1
/a/b/e 0
/d 1
/d/f 0
/e 0

My approach was to first split each of these paths and assign to it the corresponding directory depth value. For instance, /a/b will be changed to 3 new records,

/ 2
/a 1 
/a/b 0 

Then I tried to group similar paths and sum the depth values present in each tuple. However, these results are inaccurate as they don't consider the fact that for each record, a path which is split will be counted more than once. How can I achieve the desired output? Any kind of help would be very useful. Thank you.

Shane R
  • 11
  • 2

0 Answers0