12

I've been working in a project that manages big lists of words and pass them trough a lot of tests to validate or not each word of the list. The funny thing is that each time that I've used "faster" tools like the itertools module, they seem to be slower.

Finally I decided to ask the question because it is possible that I be doing something wrong. The following code will try to test the performance of the any() function versus the use of loops.

#!/usr/bin/python3
#

import time
from unicodedata import normalize


file_path='./tests'


start=time.time()
with open(file_path, encoding='utf-8', mode='rt') as f:
    tests_list=f.read()
print('File reading done in {} seconds'.format(time.time() - start))

start=time.time()
tests_list=[line.strip() for line in normalize('NFC',tests_list).splitlines()]
print('String formalization, and list strip done in {} seconds'.format(time.time()-start))
print('{} strings'.format(len(tests_list)))


unallowed_combinations=['ab','ac','ad','ae','af','ag','ah','ai','af','ax',
                        'ae','rt','rz','bt','du','iz','ip','uy','io','ik',
                        'il','iw','ww','wp']


def combination_is_valid(string):
    if any(combination in string for combination in unallowed_combinations):
        return False

    return True


def combination_is_valid2(string):
    for combination in unallowed_combinations:
        if combination in string:
            return False

    return True


print('Testing the performance of any()')

start=time.time()
for string in tests_list:
    combination_is_valid(string)
print('combination_is_valid ended in {} seconds'.format(time.time()-start))


start=time.time()
for string in tests_list:
    combination_is_valid2(string)
print('combination_is_valid2 ended in {} seconds'.format(time.time()-start))  

The previous code is pretty representative of the kind of tests I do, and if we take a look to the results:

File reading done in 0.22988605499267578 seconds
String formalization, and list strip done in 6.803032875061035 seconds
38709922 strings
Testing the performance of any()
combination_is_valid ended in 80.74802565574646 seconds
combination_is_valid2 ended in 41.69514226913452 seconds


File reading done in 0.24268722534179688 seconds
String formalization, and list strip done in 6.720442771911621 seconds
38709922 strings
Testing the performance of any()
combination_is_valid ended in 79.05265760421753 seconds
combination_is_valid2 ended in 42.24800777435303 seconds

I find kinda amazing that using loops is half faster than using any(). What would be the explanation for that? Am I doing something wrong?

(I used python3.4 under GNU-Linux)

2 Answers2

4

Actually the any() function is equal to following function :

def any(iterable):
    for element in iterable:
        if element:
            return True
    return False

which is like your second function, but since the any() returns a boolean value by itself, you don't need to check for the result and then return a new value, So the difference of performance is because of that you are actually use a redundant return and if conditions,also calling the any inside another function.

So the advantage of any here is that you don't need to wrap it with another function because it does all the things for you.

Also as @interjay mentioned in comment it seems that the most important reason which I missed is that you are passing a generator expression to any() which doesn't provide the results at once and since it produce the result on demand it does an extra job.

Based on PEP 0289 -- Generator Expressions

The semantics of a generator expression are equivalent to creating an anonymous generator function and calling it. For example:

g = (x**2 for x in range(10))
print g.next()

is equivalent to:

def __gen(exp):
    for x in exp:
        yield x**2
g = __gen(iter(range(10)))
print g.next()

So as you can see each time that python want to access the next item it calls the iter function and the next method of a generator.And finally the result is that it's overkill to use any() in such cases.

Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • 5
    The effects of a single additional `if` would be negligible compared to the loop. A much bigger difference is that the `any` version uses a generator expression. – interjay Nov 10 '15 at 16:05
  • @interjay It's not because of `if` as I said it's because of extra function call and the condition. But the generator expression would also be a reason, but not much. – Mazdak Nov 10 '15 at 16:10
  • 3
    The loop does a lot of work, looking for multiple substrings. It's very unlikely that an additional function call and `if` would double the time it takes. – interjay Nov 10 '15 at 16:12
  • @interjay Yep. I see. thanks for attention, I just update the answer with your hint. – Mazdak Nov 10 '15 at 16:21
  • By using the generator expression, `any` has to use one additional function call _per element_. So I thing @interjay is right here. – Tali Nov 10 '15 at 16:38
  • Switching to a list comprehension might save a little overhead, but it'll also stop the `any` evaluation from short-circuiting. It might not be a net positive. If the speed difference is important, the explicit loop will be fastest. – user2357112 Nov 11 '15 at 18:44
  • @user2357112 Yep, with regards to all that `any()` is for It's overkill here.Thanks for mentioning that. – Mazdak Nov 11 '15 at 18:56
1

Since your true question is answered, I'll take a shot at the implied question:

You can get a free speed boost by just doing unallowed_combinations = sorted(set(unallowed_combinations)), since it contains duplicates.

Given that, the fastest way I know of doing this is

valid3_re = re.compile("|".join(map(re.escape, unallowed_combinations)))

def combination_is_valid3(string):
    return not valid3_re.search(string)

With CPython 3.5 I get, for some test data with a line length of 60 characters,

combination_is_valid ended in 3.3051061630249023 seconds
combination_is_valid2 ended in 2.216959238052368 seconds
combination_is_valid3 ended in 1.4767844676971436 seconds

where the third is the regex version, and on PyPy3 I get

combination_is_valid ended in 2.2926249504089355 seconds
combination_is_valid2 ended in 2.0935239791870117 seconds
combination_is_valid3 ended in 0.14300894737243652 seconds

FWIW, this is competitive with Rust (a low-level language, like C++) and actually noticeably wins out on the regex side. Shorter strings favour PyPy over CPython a lot more (eg. 4x CPython for a line length of 10) since overhead is more important then.

Since only about a third of CPython's regex runtime is loop overhead, we conclude that PyPy's regex implementation is better optimized for this use-case. I'd recommend looking to see if there is a CPython regex implementation that makes this competitive with PyPy.

Veedrac
  • 58,273
  • 15
  • 112
  • 169
  • The duplicated values over the unallowed_combinations list was a mistake I did when typing the test, but thanks so much for your answer!! My benchmark was... `22.354313850402832 seconds` –  Nov 13 '15 at 23:03