I have a huge log file which contains directory paths as one of the columns. For instance,
/
/a
/a/b
/a/b/e
/d
/d/f
/e
There are no duplicate lines in the log.
My question is, using Pig, how do I count the number of sub-directories under each directory without counting the same directory name more than once? In the example above, the desired result would be somewhat like the following,
/ 6
/a 2
/a/b 1
/a/b/e 0
/d 1
/d/f 0
/e 0
My approach was to first split each of these paths and assign to it the corresponding directory depth value. For instance, /a/b will be changed to 3 new records,
/ 2
/a 1
/a/b 0
Then I tried to group similar paths and sum the depth values present in each tuple. However, these results are inaccurate as they don't consider the fact that for each record, a path which is split will be counted more than once. How can I achieve the desired output? Any kind of help would be very useful. Thank you.