Map Reduce that counts a parameter from a line and then count a second parameter

Question

Imagine I have a logfile full of lines:

"a,b,c", whereas these are variables that can have any value, but re-occurances of the values do happen and that is what this analysis will be about.

First Step

Map all 'c' URLs, where 'a' equals a specific domain, e.g. "stackoverflow.com" and c equals URLs like "stackoverflow.com/test/user/" I have a regex written that accomplishes this.

Second Step

Count (reduce) all counted c's (URLs) so then I have a list with total counts for each URL. This works fine.

Third Step

(not implemented yet and topic of this qeustion)

Look for all b's (browser names) for each counted URL from step 2. Give back a relational list, e.g. a dictionary ADT or JSON, that looks as follows:

[
   {
    "url":Stackoverflow.com/login, 
    "count": 200.654, 
    "browsers":[
      Firefox 33, 
      IE 7, 
      Opera
    ]
   },
   {..},
   {..}
 ],

I was thinking of introducing a combiner in my code (see below), or chain stuff. But the real question here is how can my job flow be optimsed so that I have to run through all log lines only once?

MapReduce Job (mrjob)

FULL_URL_WHERE_DOMAIN_EQUALS = mySuperCoolRegex

class MRReferralAnalysis(MRJob):

    def mapper(self, _, line):

        for group in FULL_URL_WHERE_DOMAIN_EQUALS.findall(line):
            yield (group, 1)

    def reducer(self, itemOfInterest, counts):
        yield (sum(counts), itemOfInterest)

    def steps(self):
        return [
            MRStep( mapper=self.mapper,
                    reducer=self.reducer)
        ]

if __name__ == '__main__':
    MRReferralAnalysis.run()

Wrap up

This is what I want in pseudo code:

LOGS_1 -> MAPREDUCE OVER SOME_CRITERIA -> LIST_1

FOR EVERY ITEM IN LIST_1:
    LOGS_1 -> MAPREDUCE OVER ITEM_CRITERIA -> LIST_2

Can you provide an example of the log file, just a few lines, and the regex for the **first step**? — wwii, Oct 31 '14 at 18:35

wwii · Answer 1 · 2014-11-03T13:22:43.303

Here is a non-MRJob, non-mapreduce solution. It runs through the log-file once. It varies a bit from your output spec, browsers is list of (browser, count) tuples and it produces dictionaries which are unordered. collections.OrderedDict could be substituted.

Assumes a file that looks like this

domain,browser,url
wonderful.edu,IE,wonderful.edu/pix
odd.org,Firefox,odd.org/login
wonderful.edu,Opera,wonderful.edu/pix

Read the file and sort by domain, url, browser for use with itertools.groupby

import collections, itertools, operator
with open('fake.log') as f:
    lines = [tuple(line.strip().split(',')) for line in f]

lines.sort(key = operator.itemgetter(0,2,1))

A few useful callables

domain = operator.itemgetter(0)
browser = operator.itemgetter(1)
url = operator.itemgetter(2)

Use collections.Counter to count the browsers for each unique url. The url count is the sum of all the browser counts.

results = list()
FULL_URL_WHERE_DOMAIN_EQUALS = re.compile('.*\.(edu|org|com)')
for d, group in itertools.groupby(lines, domain):
    # this outer loop only needed if filtering by domain
    if not FULL_URL_WHERE_DOMAIN_EQUALS.match(d):
        print d
        continue    
    for u, group2 in itertools.groupby(group, url):
        browsers = itertools.imap(browser, group2)
        browser_count = collections.Counter(browsers)
        results.append({'url' : u,
                        'count' : sum(browser_count.viewvalues()),
                        'browsers' : browser_count.items()}
                       )

Produces

[{'browsers': [('Chrome', 2), ('IE', 4), ('Opera', 7), ('Firefox', 6)],
  'count': 19,
  'url': 'odd.org/foo'},
  {...},
  {...}]

Interesting line of thought, i will give it a serious look on Monday. Thanks. — Stephan Kristyn, Nov 01 '14 at 13:54
Ok problem with your code, it is not a(ny) map reduce job, maybe I should have better emphasised that I have gigabytes of data to process. I post my solution as soon as I have fiddled it out. — Stephan Kristyn, Nov 03 '14 at 09:10
Use this and set it to run overnight while your fiddling- see how far it gets. Still need to persist it somehow. — wwii, Nov 03 '14 at 13:26
Hi, this will take 7 days to run at least. I need map reduce. Thanks anyway. I think I figured a way to achieve it. — Stephan Kristyn, Nov 03 '14 at 14:41

Map Reduce that counts a parameter from a line and then count a second parameter

First Step

Second Step

MapReduce Job (mrjob)

Wrap up

1 Answers1