Optimization Break-even Point: iterate many times over set or convert to list first?

Question

Here's something I've always wondered about. I'll pose the question for Python, but I would also welcome answers which address the standard libraries in Java and C++.

Let's say you have a Python list called "my_list", and you would like to iterate over its unique elements. There are two natural approaches:

#iterate over set
for x in set(my_list):
    do_something(x)

or

#list to set to list
for x in list(set(my_list)):
    do_something(x)

The tension is that iterating over a list is faster than iterating over a set, but it takes time to convert a set to a list. My guess is that the answer to this question will depend on a number of factors, such as:

How many times will we need to iterate?
How big is the original list?
How many repetitions in the original list should we expect?

So I guess I'm looking for a rule of thumb of the form "If the list has x many elements with each element repeated no more than y times and you only need to iterate z times then you should iterate over the set; otherwise you should convert it to a list."

1 in Python because of how iterators are implemented. In general - it may depend, if the language doesn't have a concept of iterators, which can preserve location in the sequence they iterate over, it may be better to create a vector explicitly beforehand, but it's hard to think about a language where this would happen. Maybe something randomly illogical like JavaScript? — , Sep 19 '15 at 00:21
Paul, thanks for providing me with the opportunity to answer this question, I hope it helps, if any part is unclear, let me know and I will clarify. — Russia Must Remove Putin, Sep 20 '15 at 13:51
@kdgregory A python set is a hash table under the hood, and iterating over hash tables is proportional to the number of elements plus the number of buckets; for lists it's just the number of elements. — Paul Siegel, Sep 20 '15 at 14:23
Iterating over a hashtable may well be O(entries + buckets), but as bucket count is (typically) proportional to the number of entries, this is equivalent to O(entries), I.e. it has the same algorithmic complexity as iterating over a typical list structure. The relative speeds are down to specific implementations, and while I don't know about python's implementation, I can show you systems where hash tables (with entries clustered in memory) outperform linked lists (with entries scattered around the heap). — Jules, Sep 20 '15 at 14:47
Well, low algorithmic complexity is a necessary but not sufficient condition for high performance. And I wouldn't make claims about how algorithms perform in all platforms, but my experience indicates that linked lists outperform hash tables for the purposes of iteration in Python, Java and C++ standard libraries. I suspect that it's not just the number of buckets; I think the constants are worse too (though I don't have hard evidence). — Paul Siegel, Sep 20 '15 at 15:13
I've got an additional analysis based on your comments. Let me know what you think. — Russia Must Remove Putin, Sep 20 '15 at 23:28
@AaronHall Your updated answer looks great, and it most likely answers my question. But it's difficult to re-ask a question once an answer has been accepted, so I just want to fool around a bit before I pull the trigger. Thanks for your help! — Paul Siegel, Sep 21 '15 at 13:25

Russia Must Remove Putin · Accepted Answer · 2016-06-06T17:01:54.650

I'm looking for a rule of thumb...

Rule of thumb

Here's the best rule of thumb for writing optimal Python: use as few intermediary steps as possible and avoid materializing unnecessary data structures.

Applied to this question: sets are iterable. Don't convert them to another data structure just to iterate over them. Trust Python to know the fastest way to iterate over sets. If it were faster to convert them to lists, Python would do that.

On optimization:

Don't attempt to prematurely optimize by adding complexity to your program. If your program takes too long, profile it, then optimize the bottlenecks. If you're using Python, you're probably more concerned about development time than exactly how long your program takes to run.

Demonstration

In Python 2.7:

import collections
import timeit
blackhole = collections.deque(maxlen=0).extend
s = set(xrange(10000))

We see for larger n, simpler is better:

>>> timeit.timeit(lambda: blackhole(s))
108.87403416633606
>>> timeit.timeit(lambda: blackhole(list(s)))
189.0135440826416

And for smaller n the same relationship holds:

>>> s = set(xrange(10))
>>> timeit.timeit(lambda: blackhole(s))
0.2969839572906494
>>> timeit.timeit(lambda: blackhole(list(s)))
0.630713939666748

Yes, lists iterate faster than sets (try it on your own Python interpreter):

l = list(s)
timeit.repeat(lambda: blackhole(l))

But that doesn't mean you should convert sets to lists just for iteration.

Break-even Analysis

OK, so you've profiled your code and found you're iterating over a set a lot (and I presume the set is static, else what we're doing is very problematic). I hope you're familiar with the set methods, and aren't replicating that functionality. (I also think you should consider linking frozensets with tuples, because using a list (mutable) to substitute for a canonical set (also mutable) seems like it could be error-prone.) So with that caveat, let's do an analysis.

It may be that by making an investment in complexity and taking greater risks of errors from more lines of code you can get a good payoff. This analysis will demonstrate the breakeven point on this. I do not know how much more performance you would need to pay for the greater risks and dev time, but this will tell you at what point you can start to pay towards those.:

import collections
import timeit
import pandas as pd

BLACKHOLE = collections.deque(maxlen=0).extend

SET = set(range(1000000))

def iterate(n, iterable):
    for _ in range(n):
        BLACKHOLE(iterable)

def list_iterate(n, iterable):
    l = list(iterable)
    for _ in range(n):
        BLACKHOLE(l)

columns = ('ConvertList', 'n', 'seconds')

def main():
    results = {c: [] for c in columns}

    for n in range(21):
        for fn in (iterate, list_iterate):
            time = min(timeit.repeat((lambda: fn(n, SET)), number=10))
            results['ConvertList'].append(fn.__name__)
            results['n'].append(n)
            results['seconds'].append(time)

    df = pd.DataFrame(results)
    df2 = df.pivot('n', 'ConvertList')
    df2.plot()
    import pylab
    pylab.show()

And it looks like your break-even point is at 5 complete iterations. With 5 or less on average, it could not possibly make sense to do this. Only at 5 or more do you begin to compensate for the additional development time, complexity (increasing maintenance costs), and risk from the greater number of lines of code.

I think you'd have to be doing it quite a lot to be worth tacking on the added complexity and lines of code to your project.

These results were created with Anaconda's Python 2.7, from an Ubuntu 14.04 terminal. You may get varying results with different implementations and versions.

Concerns

What I'm concerned about is that sets are mutable, and lists are mutable. A set will prevent you from modifying it while iterating over it, but a list created from that set will not:

>>> s = set('abc')
>>> for e in s:
...     s.add(e + e.upper())
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Set changed size during iteration

If you modify your set while iterating over the derivative list, you won't get an error to tell you that you did that.

>>> for e in list(s):
...     s.add(e + e.upper())

That's why I also suggested to use frozensets and tuples instead. It will be a builtin guard against a semantically incorrect alteration of your data.

>>> s = frozenset('abc')
>>> t_s = tuple(s)
>>> for e in t_s:
...     s.add(e + e.upper())
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
AttributeError: 'frozenset' object has no attribute 'add'

In the end, you have to trust yourself to get your algorithm correct. Frequently, I'm told I gave good advice when I warn newer Python users about these sorts of things. They learn it was good advice because they didn't listen at first, and found it created unnecessary complexity, complications, and resulting problems. But there's also things like logical correctness that you'll only have yourself to blame if you don't get right. Minimizing things that can go wrong is a benefit that is usually worth the performance tradeoff. And again, if performance (and not correctness or speed of development) was a prime concern while tackling this project, you wouldn't be using Python.

In fact this question arose because this issue genuinely was the bottleneck in some of my code: I was iterating many times over a set with millions of elements, and I was able to dramatically improve performance simply by converting it to a list. That's what got me wondering about the relative cost of converting a set to a list vs iterating over the set. — Paul Siegel, Sep 20 '15 at 14:14
This is really nice, thanks - in my case I was generally going through at least a few hundred iterations, sometimes thousands, so this explains the sharp performance improvement. However, I'm a little confused by your warnings about development time, complexity and risk. The only difference between your "iterate" and "list_iterate" functions is the line "l = list(iterable)", which seems like fairly negligible development time and complexity. And unless Python behaves strangely under the hood, converting a set to a list should be exactly as dangerous as iterating over the set. — Paul Siegel, Sep 24 '15 at 11:40