Imagine I have a logfile full of lines:
"a,b,c", whereas these are variables that can have any value, but re-occurances of the values do happen and that is what this analysis will be about.
First Step
Map all 'c' URLs, where 'a' equals a specific domain, e.g. "stackoverflow.com" and c equals URLs like "stackoverflow.com/test/user/" I have a regex written that accomplishes this.
Second Step
Count (reduce) all counted c's (URLs) so then I have a list with total counts for each URL. This works fine.
Third Step
(not implemented yet and topic of this qeustion)
Look for all b's (browser names) for each counted URL from step 2. Give back a relational list, e.g. a dictionary ADT or JSON, that looks as follows:
[
{
"url":Stackoverflow.com/login,
"count": 200.654,
"browsers":[
Firefox 33,
IE 7,
Opera
]
},
{..},
{..}
],
I was thinking of introducing a combiner in my code (see below), or chain stuff. But the real question here is how can my job flow be optimsed so that I have to run through all log lines only once?
MapReduce Job (mrjob)
FULL_URL_WHERE_DOMAIN_EQUALS = mySuperCoolRegex
class MRReferralAnalysis(MRJob):
def mapper(self, _, line):
for group in FULL_URL_WHERE_DOMAIN_EQUALS.findall(line):
yield (group, 1)
def reducer(self, itemOfInterest, counts):
yield (sum(counts), itemOfInterest)
def steps(self):
return [
MRStep( mapper=self.mapper,
reducer=self.reducer)
]
if __name__ == '__main__':
MRReferralAnalysis.run()
Wrap up
This is what I want in pseudo code:
LOGS_1 -> MAPREDUCE OVER SOME_CRITERIA -> LIST_1
FOR EVERY ITEM IN LIST_1:
LOGS_1 -> MAPREDUCE OVER ITEM_CRITERIA -> LIST_2