4

If I have a list that is already sorted and use the in keyword, for example:

a = [1,2,5,6,8,9,10]
print 8 in a

I think this should do a sequential search but can I make it faster by doing binary search? Is there a pythonic way to search in a sorted list?

off99555
  • 3,797
  • 3
  • 37
  • 49
  • 'I think this should do a sequential search'. Why do you think that is what is happening? –  Apr 14 '16 at 07:25
  • 1
    convert it to a set and then use "in" – Benjamin Apr 14 '16 at 07:25
  • 1
    @Lutz Because the interpreter cannot magically figure out that the list is sorted? – Voo Apr 14 '16 at 07:25
  • @Voo Is this a question or a statement? –  Apr 14 '16 at 07:26
  • 1
    @Lutz `def is_in(some_arr, val): return val in some_arr` - how do you think the interpreter should figure out whether `some_arr` is sorted or not. Clearly this is impossible so it can't do that. Well it could have an extra check in there to figure out if the list is sorted and then use a binary search - but since that requires to go through the whole list, that rather defeats the purpose. – Voo Apr 14 '16 at 07:27
  • 1
    @Benjamin: conversion to a set is only helpful if you want to do multiple `in` tests. If the list is sorted, bisection (O(logN)) is going to beat conversion to a set (O(N)). – Martijn Pieters Apr 14 '16 at 07:35
  • @AnttiHaapala: `bisect` is C-accelerated. – Martijn Pieters Apr 14 '16 at 07:57
  • @MartijnPieters I stand corrected :d. Seems to have been there since Python 2.4 – Antti Haapala -- Слава Україні Apr 14 '16 at 07:58
  • I was expecting that python could have a class invariant like SortedList so that when use along with "in" operator it will use optimized binary search instead of sequential search. That's why this question occurred. And I think that this is a pretty general question but I couldn't find anyone asking it so I asked it myself. – off99555 Apr 14 '16 at 08:40

3 Answers3

7

The standard library has the bisect module which supports searching in sorted sequences.

However, for small lists, I would bet that the C implementation behind the in operator would beat out bisect. You'd have to measure with a bunch of common cases to determine the real break-even point on your target hardware...


It's worth noting that if you can get away with an unordered iterable (i.e. a set), then you can do the lookup in O(1) time on average (using the in operator), compared to bisection on a sequence which is O(logN) and the in operator on a sequence which is O(N). And, with a set you also avoid the cost of sorting it in the first place :-).

mgilson
  • 300,191
  • 65
  • 633
  • 696
  • 1
    I did some tests, the break-even point is quite small actually; about 30 integers in range 0-60, if half of the lookups would be misses. – Antti Haapala -- Слава Україні Apr 14 '16 at 08:32
  • @AnttiHaapala -- That sounds pretty reasonable. Thanks for doing that :-). It gets really interesting doing these kinds of tests in compiled languages like C or Fortran. Then [cache locality and branch prediction](http://stackoverflow.com/q/10524032/748858) can start to really influence your runtime. – mgilson Apr 14 '16 at 15:33
5

There is a binary search for Python in the standard library, in module bisect. It does not support in/contains as is, but you can write a small function to handle it:

from bisect import bisect_left
def contains(a, x):
    """returns true if sorted sequence `a` contains `x`"""
    i = bisect_left(a, x)
    return i != len(a) and a[i] == x

Then

>>> contains([1,2,3], 3)
True
>>> contains([1,2,3], 4)
False

This is not going to be very speedy though, as bisect is written in Python, and not in C, so you'd probably find sequential in faster for quite a lot cases. bisect has had an optional C acceleration in CPython since Python 2.4.

It is hard to time the exact break-even point in CPython. This is because the code is written in C; if you check for a value that is greater to or less than any value in the sequence, then the CPU's branch prediction will play tricks on you, and you get:

In [2]: a = list(range(100))
In [3]: %timeit contains(a, 101)
The slowest run took 8.09 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 370 ns per loop

Here, the best of 3 is not representative of the true running time of the algorithm.

But tweaking tests, I've reached the conclusion that bisecting might be faster than in for lists having as few as 30 elements.


However, if you're doing really many in operations you ought to use a set; you can convert the list once into a set (it does not even be sorted) and the in operation will be asymptotically faster than any binary search ever would be:

>>> a = [10, 6, 8, 1, 2, 5, 9]
>>> a_set = set(a)
>>> 10 in a_set
True

On the other hand, sorting a list has greater time-complexity than building a set, so most of the time a set would be the way to go.

1

I would go with this pure one-liner (providing bisect is imported):

a and a[bisect.bisect_right(a, x) - 1] == x

Stress test:

from bisect import bisect_right
from random import randrange

def contains(a, x):
    return a and a[bisect.bisect_right(a, x) - 1] == x

for _ in range(10000):
    a = sorted(randrange(10) for _ in range(10))
    x = randrange(-5, 15)
    assert (x in a) == contains(a, x), f"Error for {x} in {a}"    

... doesn't print anything.

Aristide
  • 3,606
  • 2
  • 30
  • 50