4

I'm currently working on a high-performance python 2.7 project utilizing lists ten thousands elements in size. Obviously, every operation must be performed as fast as possible.

So, I have two lists: One of them is a list of unique arbitrary numbers, let's call it A, and the other one is a linear list starting with 1 and with the same length as the first list, named B, which represents indices in A (starting with 1)

Something like enumerate, starting with 1.

For example:

A = [500, 300, 400, 200, 100] # The order here is arbitrary, they can be any integers, but every integer can only exist once
B = [  1,   2,   3,   4,   5] # This is fixed, starting from 1, with exactly as many elements as A

If I have an element of B (called e_B) and want the corresponding element in A, I can simply do correspond_e_A = A[e_B - 1]. No problem.

But now I have a huge list of random, non-unique integers, and I want to know the indices of the integers that are in A, and what the corresponding elements in B are.

I think I have a reasonable solution for the first question:

indices_of_existing = numpy.nonzero(numpy.in1d(random_list, A))[0]

What is great about this approach is that there is no need to map() single operations, numpy's in1d just returns a list like [True, True, False, True, ...]. Using nonzero() I can get the indices of the elements in random_list that exist in A. Perfect, I think.

But for the second question, I'm stumped. I tried something like:

corresponding_e_B = map(lambda x: numpy.where(A==x)[0][0] + 1, random_list))

This correctly gives me the indices, but it's not optimal, because firstly I need a map(), secondly I need a lambda, and finally numpy.where() does not stop after the item was found once (remember, A has only unique elements), meaning that it scales horribly with huge datasets like mine.

I took a look at bisect, but it seems bisect only works with single requests, not with lists, meaning that I'd still have to use map() and build my list elementwise (that's slow, isn't it?)

Since I'm quite new to Python, I was hoping anyone here might have an idea? Maybe a library I don't know yet?

JiaYow
  • 5,207
  • 3
  • 32
  • 36

2 Answers2

4

I think you would be well advised to use a hashtable for your lookups instead of numpy.in1d, which uses a O(n log n) merge sort as a preprocessing step.

>>> A = [500, 300, 400, 200, 100]
>>> index = { k:i for i,k in enumerate(A, 1) }
>>> random_list = [200, 100, 50]
>>> [i for i,x in enumerate(random_list) if x in index]
[0, 1]
>>> map(index.get, random_list)
[4, 5, None]
>>> filter(None, map(index.get, random_list))
[4, 5]

This is Python 2, in Python 3 map and filter return generator-like objects, so you would need a list around filter if you want to get the result as a list.

I have tried to use builtin functions as much as possible to push the computational burden to the C side (assuming you use CPython). All the names are resolved upfront, so it should be pretty fast.

In general, for maximum performance, you might want to consider using PyPy, a great alternative Python implementation with JIT compilation.

A benchmark to compare multiple approaches is never a bad idea:

import sys
is_pypy = '__pypy__' in sys.builtin_module_names

import timeit
import random
if not is_pypy:
  import numpy
import bisect

n = 10000
m = 10000
q = 100

A = set()
while len(A) < n: A.add(random.randint(0,2*n))
A = list(A)

queries = set()
while len(queries) < m: queries.add(random.randint(0,2*n))
queries = list(queries)

# these two solve question one (find indices of queries that exist in A)
if not is_pypy:
  def fun11():
    for _ in range(q):
      numpy.nonzero(numpy.in1d(queries, A))[0]

def fun12():
  index = set(A)
  for _ in range(q):
    [i for i,x in enumerate(queries) if x in index]

# these three solve question two (find according entries of B)
def fun21():
  index = { k:i for i,k in enumerate(A, 1) }
  for _ in range(q):
    [index[i] for i in queries if i in index]

def fun22():
  index = { k:i for i,k in enumerate(A, 1) }
  for _ in range(q):
    list(filter(None, map(index.get, queries)))

def findit(keys, values, key):
  i = bisect.bisect(keys, key)
  if i == len(keys) or keys[i] != key:
    return None
  return values[i]

def fun23():
  keys, values = zip(*sorted((k,i) for i,k in enumerate(A,1)))
  for _ in range(q):
    list(filter(None, [findit(keys, values, x) for x in queries]))

if not is_pypy:
  # note this does not filter out nonexisting elements
  def fun24():
    I = numpy.argsort(A)
    ss = numpy.searchsorted
    maxi = len(I)
    for _ in range(q):   
      a = ss(A, queries, sorter=I)
      I[a[a<maxi]]

tests = ("fun12", "fun21", "fun22", "fun23")
if not is_pypy: tests = ("fun11",) + tests + ("fun24",)

if is_pypy:
  # warmup
  for f in tests:
    timeit.timeit("%s()" % f, setup = "from __main__ import %s" % f, number=20)

# actual timing
for f in tests:
  print("%s: %.3f" % (f, timeit.timeit("%s()" % f, setup = "from __main__ import %s" % f, number=3)))

Results:

$ python2 -V
Python 2.7.6
$ python3 -V
Python 3.3.3
$ pypy -V
Python 2.7.3 (87aa9de10f9ca71da9ab4a3d53e0ba176b67d086, Dec 04 2013, 12:50:47)
[PyPy 2.2.1 with GCC 4.8.2]
$ python2 test.py
fun11: 1.016
fun12: 0.349
fun21: 0.302
fun22: 0.276
fun23: 2.432
fun24: 0.897
$ python3 test.py
fun11: 0.973
fun12: 0.382
fun21: 0.423
fun22: 0.341
fun23: 3.650
fun24: 0.894
$ pypy ~/tmp/test.py
fun12: 0.087
fun21: 0.073
fun22: 0.128
fun23: 1.131

You can tweak n (size of A), m (size of random_list) and q (number of queries) to your scenario. To my surprise, my attempt to be clever and use builtin functions instead of list comps has not paid off, since fun22 is not a lot faster than fun21 (only ~10% In Python 2 and ~25% in Python 3, but almost 75% slower in PyPy). A case of premature optimization here. I guess the difference is due to the fact that fun22 builds up an unnecessary temporary list per query in Python 2. We also see that binary search is pretty bad (look at fun23).

Niklas B.
  • 92,950
  • 18
  • 194
  • 224
  • Aaalright, that's brilliant! Worked like a charm. And 10% improvement *is* a lot faster, at least for my use case. So I just wanted to comment that your solution does not solve question 1, but apparently you edited the solution right in :-D Thanks so much! That's a great answer btw, including benchmarks and so on :-) – JiaYow Feb 01 '14 at 20:58
  • @JiaYow: Well you solved question 1 yourself. I initially thought that `in1d` was implemented in C, but it's not (and it also uses a somewhat suboptimal algorithm) so we can do better even with plain Python – Niklas B. Feb 01 '14 at 20:59
  • Yes but I was thinking/hoping that both questions could be answered at once. Since we already loop through the list to convert the entries, it should be possible to get the Not-None indices in the same go. filter() only gets the Not-None *values*, while np.nonzero() would get the Not-None *indices*. I was hoping those could be combined (maybe without np. According to my timers, np.array() is now starting to become a bottleneck) – JiaYow Feb 01 '14 at 21:03
  • @JiaYow: Sure, you can do both in one single pass: `[(i,index[x]) for i,x in enumerate(random_list) if x in index]` gives you a list of pairs of `(index in random_list, index in A)` for those elements of `random_list` that exist in `A` as well. – Niklas B. Feb 01 '14 at 21:14
  • @JiaYow: By the way, I added PyPy to the benchmark. It's about 2-3 times faster than CPython 2.7 in all the benchmarks and 4 times faster for the algorithm to solve question two. It obviously doesn't profit from using builtin functions at all, which is to be expected since it JIT-compiles the list comprehensions. You should definitely consider using it, but unfortunately it only partially supports numpy as of yet :( – Niklas B. Feb 01 '14 at 21:17
  • Yes, unfortunately numpy is a requirement in my project :-( But this "both in one single pass" thing does not do the mapping, does it? I don't get it? But nevermind, just by doing both operations in succession (first get those existent in A, then only map this filtered list - I don't even need filter(None,), because I know the elements exist in A), I got an enormous performance gain: My first settings took about 17 seconds. With your style, this time reduced to friggin 1.099 seconds. And that's for a *tiny* subset of my data. Thanks so much! :-) – JiaYow Feb 01 '14 at 21:38
  • @JiaYow: For `A = [500, 300, 400, 200, 100]; random_list = [200, 100, 50]; result = [(i,index[x]) for i,x in enumerate(random_list) if x in index]` we get `result == [(0, 4), (1, 5)]`. This gives you the indices of `random_list` that exist in `A` (`0`, `1`), along with the indices of the corresponding values of `A` (`4`, `5`). You can splice the list of tuples if you need two separate lists: `a, b = zip(*result)`. – Niklas B. Feb 01 '14 at 21:43
  • The 17-fold performance difference is due to the log-factor in the sorting step. The larger your input set is, the larger will that factor be (the "constant" factor of hash tables will probably also grow a bit with larger data, but not nearly as much) – Niklas B. Feb 01 '14 at 21:44
  • Yup, that last solution substracted another 0.5 seconds from a 15 second operation. Considering the scale of my data set, that's not bad at all :-) Thank you again. You're pretty awesome^.^ – JiaYow Feb 01 '14 at 22:16
  • @JiaYow: Thanks bro, good luck with your project. I'm just procrastinating while I should really start to write some code for my bachelor's thesis so thanks for the opportunity ;) – Niklas B. Feb 01 '14 at 22:18
  • not sure how a numpy solution would compare to pypy; but I highly doubt that the n log n character of has much to do with it. swapping an element in a C array is like what; 100x faster than creating and hashing a python object? exp(100) is a big number. ill cook up a numpy solution for comparison – Eelco Hoogendoorn Feb 01 '14 at 22:24
  • @EelcoHoogendoorn: As you can see from the benchmark, even a simple list comprehension outperforms `in1d` for this particular problem. "Creating and hashing" Python objects is not necessary here. Integers are represented in CPython as 2 words, hashing is a simple `O(1)` procedure implemented in C and hash tables are implemented in C as well. In fact, as I showed, you actually can avoid using *any Python at all* as shown, so we are comparing a sort-based solution to a hash-based solution, both implemented in C. Obviously hash-based wins, as is to be expected. – Niklas B. Feb 01 '14 at 22:29
  • I'd be happy to include another numpy-based solution to the benchmark for comparison, though :) – Niklas B. Feb 01 '14 at 22:31
  • @EelcoHoogendoorn: Also, performance-wise, I think swapping two random elements in a large enough array is not really a lot faster than hashing an integer and inserting it into a well implemented hash table, since the dominating cost is the number of cache misses, which is small in both cases. – Niklas B. Feb 01 '14 at 22:39
  • dunno; if the time constant of the python solution is indeed that small, it may be faster. but that is more likely to be the deciding factor than the log(N) id say. – Eelco Hoogendoorn Feb 01 '14 at 22:42
  • @EelcoHoogendoorn: Basic complexity analysis says that sort has `O(n log n)` asymptotic complexity while inserting `n` elements in a hash table has expected complexity of `O(n)`. So how is it unreasonable to interpret an increasing ratio of the runtimes of two algorithms with increasing `n` as a manifestation of the very fact that one algorithm has polylinear runtime while the other has expected linear runtime? – Niklas B. Feb 01 '14 at 22:51
  • its not unreasonable; but log(10000) ~ 10, and the speed difference of doing things in python versus C is typically much larger than that. Even for large N, the extra log N need not be your primary concern. However, you may be right that python manages to map this problem to C pretty effectively, without needing the more heavyweight python constructs. – Eelco Hoogendoorn Feb 01 '14 at 22:58
  • @Eelco: Sure. What I'm referring to is that in my tests, `fun11` (`numpy.in1d`) is about 3 times slower than `fun12` (Python hash table). In OP's tests, it's about 17 times slower. I conclude that the reason for it being relatively even slower is due to the input size and the logarithmic factor of `fun11` that `fun12` does not have and that his input must be around `2^(17/3) = 50` times larger than mine. Maybe we had a misunderstanding here :) I'd really like to see a fast numpy solution for this, my guts say it's possible, but I don't know numpy at all... – Niklas B. Feb 01 '14 at 23:01
  • Just to clarify, the 17 times gain was all things considered: I compared your solution with mine using in1d *and* map/lambda/numpy. – JiaYow Feb 01 '14 at 23:07
  • I missed that, sorry; yeah you are of course correct, for larger N we may expect the O(N) solution to pull ahead. That said, there is still the question as to how random A is in practice. – Eelco Hoogendoorn Feb 01 '14 at 23:20
  • @Eelco: Actually the `O(n)` version is faster even for small inputs... But yeah, I guess we should move on and put this argument behind us – Niklas B. Feb 01 '14 at 23:23
2
def numpy_optimized(index, values):
    I = np.argsort(values)
    Q = np.searchsorted(values, index, sorter=I)
    return I[Q]

This calculates the same thing as OP, but with the indices in matching order to the values queried, which I imagine is an improvement in functionality. It is up to twice as fast as OP's solution on my machine; which puts it slightly ahead of the non-pypy solutions, if I interpret your benchmarks correctly.

Or in case we cannot assume all index are present in values, and would like missing queries to be silently dropped:

def numpy_optimized_filtered(index, values):
    I = np.argsort(values)
    Q = np.searchsorted(values, index, sorter=I)
    Z = I[Q]
    return Z[values[Z]==index]
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • Thanks for suggesting this. I will add it to the benchmarks – Niklas B. Feb 01 '14 at 23:05
  • So is there also a way to filter out nonexisting entries? – Niklas B. Feb 01 '14 at 23:09
  • you are right; these solutions are not identical in that regard. it worked in my code where the queries were sampled from A, but if that isn't the case this code produces different results. made an edit – Eelco Hoogendoorn Feb 01 '14 at 23:17
  • Well now it doesn't work any longer if `index` is an array (it was nice that the first solution did). – Niklas B. Feb 01 '14 at 23:20
  • By the way I added it to the benchmark in the first version. Please have a look at it so I didn't make a mistake while adapting it. It's barely faster than ops attempt to solve question one, across the entire range of input sizes it seems, but it solves question two in almost pure numpy, which is cool :) – Niklas B. Feb 01 '14 at 23:21
  • you are correct, I broke something. bedtime though, ill try to patch it up in the morning ;) – Eelco Hoogendoorn Feb 01 '14 at 23:23