Why is all() slower than using for-else & break?

Question

I've been fooling around with problem 7 from project Euler and I noticed that two of my prime finding methods are very similar but run at very different speeds.

#!/usr/bin/env python3

import timeit

def lazySieve (num_primes):
    if num_primes == 0: return []
    primes = [2]
    test = 3
    while len(primes) < num_primes:
        sqrt_test = sqrt(test)
        if all(test % p != 0 for p in primes[1:]):  # I figured this would be faster
            primes.append(test)
        test += 2
    return primes

def betterLazySieve (num_primes):
    if num_primes == 0: return []
    primes = [2]
    test = 3
    while len(primes) < num_primes:
        for p in primes[1:]: # and this would be slower
            if test % p == 0: break
        else:
            primes.append(test)
        test += 2
    return primes

if __name__ == "__main__":

    ls_time  = timeit.repeat("lazySieve(10001)",
                             setup="from __main__ import lazySieve",
                             repeat=10,
                             number=1)
    bls_time = timeit.repeat("betterLazySieve(10001)",
                             setup="from __main__ import betterLazySieve",
                             repeat=10,
                             number=1)

    print("lazySieve runtime:       {}".format(min(ls_time)))
    print("betterLazySieve runtime: {}".format(min(bls_time)))

This runs with the following output:

lazySieve runtime:       4.931611961917952
betterLazySieve runtime: 3.7906006319681183

And unlike this question, I don't simply want the returned value of any/all.

Is the return from all() so slow that if overrides it's usage in all the but most niche of cases? Is the for-else break somehow faster than the short circuited all()?

What do you think?

Edit: Added in square root loop termination check suggested by Reblochon Masque

Update: ShadowRanger's answer was correct.

After changing

all(test % p != 0 for p in primes[1:])

to

all(map(test.__mod__, primes[1:]))

I recorded the following decrease in runtime:

lazySieve runtime:       3.5917471940629184
betterLazySieve runtime: 3.7998314710566774

Edit: Removed Reblochon's speed up to keep the question clear. Sorry man.

I think that running performance testing on a small sample on one machine doesn't mean much — OneCricketeer, Apr 05 '16 at 15:01
[The Sieve of Eratosthenes](https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes) would probably be faster. — Fred Larson, Apr 05 '16 at 15:04
@FredLarson Not what I'm asking about but to reply to your comment. I have implemented it before and from my testing it's only faster if you can correctly guess a good upper bound for the sieve size. Otherwise it's more of a gamble. — obivain222, Apr 05 '16 at 15:24
@cricket_007 I've run this test several times and have gotten constant and similar answers. Within ~0.01 seconds. — obivain222, Apr 05 '16 at 15:25
BTW, some fun bits of weirdness to note on performance: Simply replacing the `test % p == 0` and `test % p != 0` with `not test % p` and `test % p` reduces run time (for largish numbers of primes) by about 12-15% when testing a number which turns out to be prime. Also, when as for "needing to guess a good upper bound on the sieve size", [there are approximations for the `pi` function that give a reliable upper bound on the number of primes below a given value](https://primes.utm.edu/howmany.html), which you could use to size your flags array reliably. — ShadowRanger, Apr 05 '16 at 16:48

score 1 · Answer 1 · answered Apr 05 '16 at 15:38

1

I may be wrong, but I think that every time it evaluates test % p != 0 in the generator expression, it's doing so in a new stack frame, so there's a similar overhead to calling a function. You can see evidence of the stack frame in tracebacks, for example:

>>> all(n/n for n in [0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <genexpr>
ZeroDivisionError: integer division or modulo by zero

answered Apr 05 '16 at 15:38

Alex Hall

34,833
5
57
89

I hadn't thought about this but the answer makes sense. I'm going to investigate more and see if this is what's going on. – obivain222 Apr 05 '16 at 16:07
1

It's not a new stack frame every time, no Python level function call overhead is being paid. There *is* a new stack frame (a single frame, which all generator expressions involve), but it's not a new one on each iteration (descriptively, the stack frame is being saved and restored when each value is `yield`ed, but the cost of saving/restoring is much lower than the cost of calling a Python function normally, because the generator protocol is heavily optimized). – ShadowRanger Apr 05 '16 at 16:16
@ShadowRanger that's cool to know. Nevertheless shouldn't there be some overhead for the switches between frames? – Alex Hall Apr 05 '16 at 16:26
There is, but it's only part of the equation. There's also the one-time cost to load the code object, create the closure over it, and call the closure (which creates the initial frame), then load `all` (the `B` in LEGB is a killer) and call it, all of which increases the setup overhead to the point where it swamps the actual work in the loop for smallish loops, and to perform lookup of the non-local variables on every iteration (equivalent to a Python level `dict` lookup vs. a simple C level fixed array index lookup for accessing locals), which increases the "every loop" work. – ShadowRanger Apr 05 '16 at 16:36

ShadowRanger · Accepted Answer · 2016-04-05T16:28:11.203

It's a combination of a few issues:

Calling builtin functions and loading and executing the generator code object is semi-expensive to set up, so for small numbers of primes to test, the setup costs drown out the per test costs
Generator expressions establish an inner scope; variables not being iterated over need to go through normal LEGB lookup costs, so every iteration inside all's generator expression needs to look up test to make sure it hasn't changed, and it does so via a dict lookup (where local variable lookup is a cheap lookup in a fixed size array)
Generators have a small amount of overhead, particularly when jumping in and out of Python byte code (all is implemented at the C layer in CPython)

Things you can do to minimize the difference or eliminate it:

Run on larger iterables for the test (to minimize effect of setup costs)
Explicitly pull test into the local scope of the generator, e.g. as a silly hack all(test % p != 0 for test in (test,) for p in primes[1:])
Remove all bytecode execution from the process by using map with C builtins, e.g. all(map(test.__mod__, primes[1:])) (which also happens to achieve #2, by looking up test.__mod__ once up front, rather than once per loop)

With a large enough input, #3 can sometimes win over your original code, at least on Python 3.5 (where I microbenchmarked in ipython), depending on a host of factors. It doesn't always win because there are some optimizations in the bytecode interpreter for BINARY_MODULO for values that can fit in a CPU register that explicitly skipping straight to the int.__mod__ code bypasses, but it usually performs quite similarly.

Yup, this did it. I didn't notice a change in performance when using the `for test in (test,)` hack. However the `map(test.__mod__, primes[1:])` did the trick. — obivain222, Apr 05 '16 at 17:01

score 0 · Answer 3 · answered Apr 05 '16 at 15:38

That is an interesting question on a puzzling result, for which I unfortunately don't have a definite answer... Maybe it is because of sample size, or particulars of this calculation? But like you, I found it surprising.

However, it is possible to make lazysieve faster than betterlazysieve:

def lazySieve (num_primes):
    if num_primes == 0: 
        return []
    primes = [2]
    test = 3
    while len(primes) < num_primes:
        if all(test % p for p in primes[1:] if p <= sqr_test):
            primes.append(test)
        test += 2
        sqr_test = test ** 0.5
    return primes

It runs in about 65 % of the time of your version, and is about 15% faster than betterlazysieve on my system.

using %%timit in jupyter notebook w python 3.4.4 on an oldish macbook air:

%%timeit 
lazySieve(10001)
# 1 loop, best of 3: 8.19 s per loop

%%timeit
betterLazySieve(10001)
# 1 loop, best of 3: 10.2 s per loop

Thanks for the input! Interestingly enough though, after I add in the sqrt_test it did run faster but still not faster betterLazySieve. 13% faster to 4.23 s. — obivain222, Apr 05 '16 at 15:59
Sorry, first question on stack and still figuring out how this all works! It logged the upvote but it can't display it because I'm too low level. Sorry about the citation, couldn't find a recommended way so I just sort of guessed. — obivain222, Apr 05 '16 at 16:19

Why is all() slower than using for-else & break?

3 Answers3