Optimise RegEx Time complexity of mrJob Script containing regex

Asked Oct 23 '14 at 12:28

Active Oct 23 '14 at 12:28

Viewed 55 times

How could you optimise this MapRduce Job (mrjob):

Using this script now, any idea how to optimse? I am using a lookahead to search for the ur=www.domain.de and then mapping and counting the r2 occurneces.

from mrjob.job import MRJob
from mrjob.step import MRStep

import re

LOOK_AHEAD = re.compile(r"(?=.*?(?:^|&)ur=www\.domain\.com(?:&|$)).*?(?:^|&)r2=([^&]+)")

class MRReferralAnalysis(MRJob):

    def mapper(self, _, line):

        for group in LOOK_AHEAD.findall(line):


            print (group)
            yield (group, 1)

    def reducer(self, itemOfInterest, counts):


        yield (sum(counts), itemOfInterest)


    def steps(self):
        return [
            MRStep( mapper=self.mapper,
                    reducer=self.reducer)
        ]

if __name__ == '__main__':
    MRReferralAnalysis.run()

asked Oct 23 '14 at 12:28

Stephan Kristyn

15,015
14
88
147

could you show several examples of lines. – Casimir et Hippolyte Oct 23 '14 at 12:35
The lines in the variable `line`. In other words, the strings that are supposed to match your pattern (or not). – Casimir et Hippolyte Oct 23 '14 at 17:03

Optimise RegEx Time complexity of mrJob Script containing regex

0 Answers0