1

I have two version of some simple text parser (it validates login correctness):

rgx = re.compile(r"^[a-zA-Z][a-zA-Z0-9.-]{0,18}[a-zA-Z0-9]$")
def rchecker(login):
    return bool(rgx.match(login))

max_len = 20
def occhecker(login):
    length_counter = max_len
    for c in login:
        o = ord(c)
        if length_counter == max_len:
            if not (o > 96 and o < 123) and \
               not (o > 64 and o < 91): return False
        if length_counter == 0: return False

        # not a digit
        # not a uppercase letter
        # not a downcase letter
        # not a minus or dot
        if not (o > 47 and o < 58) and \
           not (o > 96 and o < 123) and \
           not (o > 64 and o < 91) and \
           o != 45 and o != 46: return False
        length_counter -= 1

    if length_counter < max_len:
        o = ord(c)
        if not (o > 47 and o < 58) and \
           not (o > 96 and o < 123) and \
           not (o > 64 and o < 91): return False
        else: return True
    else: return False


correct_end = string.ascii_letters + string.digits
correct_symbols = correct_end + "-."
def cchecker(login):
    length_counter = max_len

    for c in login:
        if length_counter == max_len and c not in string.ascii_letters:
            return False
        if length_counter == 0:
            return False
        if c not in correct_symbols:
            return False
        length_counter -= 1

    if length_counter < max_len and c in correct_end:
        return True
    else:
        return False

There are three methods do all the same work: check the few rules for login. I think it's clear with regex rule. I made cProfile benchmarks for these methods with 280000 logins and got results I can't understand.

with regex

     560001 function calls in 1.202 seconds

 Ordered by: standard name

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 280000    0.680    0.000    1.202    0.000 logineffcheck.py:10(rchecker)
      1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 280000    0.522    0.000    0.522    0.000 {method 'match' of '_sre.SRE_Pattern' objects}

with ord

    3450737 function calls in 8.599 seconds

 Ordered by: standard name

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 280000    5.802    0.000    8.599    0.000 logineffcheck.py:14(occhecker)
      1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
3170736    2.797    0.000    2.797    0.000 {ord}

with in method

    280001 function calls in 1.709 seconds

 Ordered by: standard name

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 280000    1.709    0.000    1.709    0.000 logineffcheck.py:52(cchecker)
      1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

I created 100k logins with correct form, 60k logins with cyrillic letters, 60k logins have length 24 instead of 20 and 60k logins have length 0. So, there are 280k. How to explain that regex is much more faster than simple cycle with ord?

jordanhill123
  • 4,142
  • 2
  • 31
  • 40
senior_pimiento
  • 702
  • 1
  • 5
  • 15

1 Answers1

0

The simple answer is that regular expressions are fast (the other methods involve lots of pure python code, but the regex module is optimized C). Also, much of the legwork is performed when the regex is compiled, and that is not counted in the performance counter.

To dig further, use the dis module (which shows you the python opcodes):

>>> dis.dis(rchecker)
  4           0 LOAD_GLOBAL              0 (bool)
              3 LOAD_GLOBAL              1 (rgx)
              6 LOAD_ATTR                2 (match)
              9 LOAD_FAST                0 (login)
             12 CALL_FUNCTION            1
             15 CALL_FUNCTION            1
             18 RETURN_VALUE



>>> dis.dis(occhecker)
  8           0 LOAD_GLOBAL              0 (max_len)
              3 STORE_FAST               1 (length_counter)

  9           6 SETUP_LOOP             224 (to 233)
              9 LOAD_FAST                0 (login)
             12 GET_ITER
        >>   13 FOR_ITER               216 (to 232)
             16 STORE_FAST               2 (c)

   ....  OUTPUT TRUNCATED, BUT THERE ARE MANY OPCODES ....

 32     >>  343 LOAD_GLOBAL              2 (False)
            346 RETURN_VALUE
        >>  347 LOAD_CONST               0 (None)
            350 RETURN_VALUE
SheetJS
  • 22,470
  • 12
  • 65
  • 75