Why is tail recursion optimization faster than normal recursion in Python?

Question

While I understand that tail recursion optimization is non-Pythonic, I came up with a quick hack to a question on here that was deleted as soon as a I was ready to post.

With a 1000 stack limit, deep recursion algorithms are not usable in Python. But sometimes it is great for initial thoughts through a solution. Since functions are first class in Python, I played with returning a valid function and the next value. Then call the process in a loop until done with single calls. I'm sure this isn't new.

What I found interesting is that I expected the extra overhead of the passing the function back and forth to make this slower than normal recursion. During my crude testing I found it to take 30-50% the time of normal recursion. (With an added bonus of allowing LONG recursions.)

Here is the code I'm running:

from contextlib import contextmanager
import time

# Timing code from StackOverflow most likely.
@contextmanager
def time_block(label):
    start = time.clock()
    try:
        yield
    finally:
        end = time.clock()
        print ('{} : {}'.format(label, end - start))


# Purely Recursive Function
def find_zero(num):
    if num == 0:
        return num
    return find_zero(num - 1)


# Function that returns tuple of [method], [call value]
def find_zero_tail(num):
    if num == 0:
        return None, num
    return find_zero_tail, num - 1


# Iterative recurser
def tail_optimize(method, val):
    while method:
        method, val = method(val)
    return val


with time_block('Pure recursion: 998'):
    find_zero(998)

with time_block('Tail Optimize Hack: 998'):
    tail_optimize(find_zero_tail, 998)

with time_block('Tail Optimize Hack: 1000000'):
    tail_optimize(find_zero_tail, 10000000)

# One Run Result:
# Pure recursion: 998 : 0.000372791020758
# Tail Optimize Hack: 998 : 0.000163852100569
# Tail Optimize Hack: 1000000 : 1.51006975627

Why is the second style faster?

My guess is the overhead with creating entries on the stack, but I'm not sure how to find out.

Edit:

In playing with call counts, I made a loop to try both at various num values. Recursive was much closer to parity when I was looping and calling multiple times.

So, I adding this before the timing, which is find_zero under a new name:

def unrelated_recursion(num):
    if num == 0:
        return num
    return unrelated_recursion(num - 1)

unrelated_recursion(998)

Now the tail optimized call is 85% of the time of the full recursion.

So my theory is that 15% penalty is the overhead for the larger stack, versus single stack.

The reason I saw such a huge disparity in execution time when only running each once was the penalty for allocation of the stack memory and structure. Once that is allocated, the cost of using them is drastically lowered.

Because my algorithm is dead simple, the memory structure allocation is a large portion of the execution time.

When I cut my stack priming call to unrelated_recursion(499), I get about half way between fully primed and not primed stack in find_zero(998) execution time. This makes sense with the theory.

Probably it comes down to the fact that it has to allocate just one stack frame instead of multiple ones. It may even benefit from the fact that probably the allocator is returning the same block at each iteration for the new frame object, so it has better cache locality. — Matteo Italia, May 12 '16 at 17:05
Switching call order affects it slightly. Normal recursion seems to win with n < 20. Between 20 and 40, depending on call order, they are equal-ish. Tail recursion wins with n > 40. So it does seem to be stack overhead related. — Joe, May 12 '16 at 17:30
If I time these calls with the `timeit` module, the tail call optimized version wins for low repetition counts, but the other version wins for high repetition counts. — user2357112, May 12 '16 at 17:53
The stack frame allocation hypothesis doesn't seem to hold up under further scrutiny. While Python does seem to reuse a single frame object for `find_zero_tail` (see the `code->co_zombieframe` handling in [`Objects/frameobject.c`](https://hg.python.org/cpython/file/2.7/Objects/frameobject.c)), if that were the only effect at play, `find_zero_tail` would still be winning at high repetition counts. — user2357112, May 12 '16 at 18:02
Does timeit run multiple times to generate a value? I need to break these into separate files and mess around a little more. The setup and break down as well as the run count of the timing mechanism used might have as much influence as the actual execution. I move my answer to the main question, because it doesn't seem to be the answer yet. — Joe, May 12 '16 at 18:04
@Joe: `timeit` runs the timed code with a configurable number of repetitions. It defaults to a million, but that'd take way too long for this code. — user2357112, May 12 '16 at 18:06
I tried to use that and thought it had issues with recursion. Didn't think about default values. The hanging I thought I had was 1 million executions, I guess. :) — Joe, May 12 '16 at 18:08
When I use 30 runs, they are about equal. Less than 30 and tail wins. Over 30 and recursive wins. Using 1 execution gives me close values to my timing code. — Joe, May 12 '16 at 18:17
What I find really strange is that if I run the timings N times in a row by hand, the tail call optimized version wins consistently, but if I tell `timeit` to run the code N times, the other version does better with larger N. — user2357112, May 12 '16 at 18:23

Julien Palard · Answer 1 · 2016-05-19T23:28:59.820

As a comment hopefully remineded me, I was not really answering the question, so here is my sentiment:

In your optimization, you're allocating, unpacking and deallocating tuples, so I tried without them:

# Function that returns tuple of [method], [call value]
def find_zero_tail(num):
    if num == 0:
        return None
    return num - 1


# Iterative recurser
def tail_optimize(method, val):
    while val:
        val = method(val)
    return val

for 1000 tries, each starting with value = 998:

this version take 0.16s
your "optimized" version took 0.22s
the "unoptimized" one took 0.29s

(Note that for me, your optimized version is faster that the un-optimized one ... but we don't do the exact same test.)

But I don't think this is usefull to get those stats: cost is more on the side of Python (methods calls, tuples allocations, ...) that your code doing real things. In a real application you'll not end up measuring the cost of 1000 tuples, but the cost of your actual implementation.

But simply don't do this: this is just hard to read for almost nothing, you're writing for the reader, not for the machine:

# Function that returns tuple of [method], [call value]
def find_zero_tail(num):
    if num == 0:
        return None, num
    return find_zero_tail, num - 1


# Iterative recurser
def tail_optimize(method, val):
    while method:
        method, val = method(val)
    return val

I won't try to implement a more readable version of it because I'll probably end up with:

def find_zero(val):
    return 0

But I think in real cases there's some nice ways to deal with recursion limits (both on memory size or depth side):

To help about memory (not depth), an lru_cache from functools may typically help a lot:

>>> from functools import lru_cache
>>> @lru_cache()
... def fib(x):
...     return fib(x - 1) + fib(x - 2) if x > 2 else 1
... 
>>> fib(100)
354224848179261915075

And for stack size, you may use a list or a deque, depending on your context and usage, instead of using the language stack. Depending on the exact implementation (when you're in fact storing simple sub-computation in your stack to re-use them) it's called dynamic programming:

>>> def fib(x):
...     stack = [1, 1]
...     while len(stack) < x:
...         stack.append(stack[-1] + stack[-2])
...     return stack[-1]
... 
>>> fib(100)
354224848179261915075

But, and here comes the nice part of using your own structure instead of the call stack, you're not always needed to keep the whole stack to continue computations:

>>> def fib(x):
...     stack = (1, 1)
...     for _ in range(x - 2):
...         stack = stack[1], stack[0] + stack[1]
...     return stack[1]
... 
>>> fib(100)
354224848179261915075

But to conclude with a nice touch of "know the problem before trying to implement it" (unreadable, hard to debug, hard to visually proove, it's bad code, but it's fun):

>>> def fib(n):
...     return (4 << n*(3+n)) // ((4 << 2*n) - (2 << n) - 1) & ((2 << n) - 1)
... 
>>> 
>>> fib(99)
354224848179261915075

If you ask me, the best implementation is the more readable one (for the Fibonacci example, probably the one with an LRU cache but by changing the ... if ... else ... with a more readable if statement, for another example a deque may be more readable, and for other examples, dynamic programming may be better...

"You're writing for the human reading your code, not for the machine".

This doesn't answer the question. The question isn't about how to write the code "right"; it's about why the performance difference is happening, and your post only has a few lines of vague speculation on that front. Also, memoization doesn't necessarily reduce the maximum amount of stack space a function uses, so it's not an effective way to prevent stack overflows, and `functools.lru_cache` defaults to a limit of 128 cached results. — user2357112, May 19 '16 at 22:25
"the tail recursion push less pressure on the memory (less allocations, probably less sbrk syscalls, so less context switches, ...)." — Julien Palard, May 19 '16 at 22:48
You didn't actually test any of that. It's just speculation. It's not even substantially new speculation. — user2357112, May 19 '16 at 22:51
You're right, because it's completely meaningless to actually test performance of a 999 calls doing nothing. But I'll gladly +1 your response if you do and find something relevant :-) — Julien Palard, May 19 '16 at 22:59
It may even be the construction / unpacking / destruction of the tuple needed to pack the method and the value... :) — Julien Palard, May 19 '16 at 23:02
@user2357112 thanks to your comments I enhanced my response. Have a good day :) — Julien Palard, May 19 '16 at 23:29
Well, you're at least adding something new on the performance testing front now, but your results don't really give any new insights on the tail recursion trampoline thing. You've made things faster, but not in a way that tells us anything about why the existing code's performance compares the way it does. — user2357112, May 19 '16 at 23:42

Why is tail recursion optimization faster than normal recursion in Python?

1 Answers1