There is list.index
, which under the hood is written C and one would superficially assume one cannot beat that. But as shown bellow, a speed of 10-20 is possible by utilizing numpy and cython.
Let's establish the baseline:
# create data set:
N=10**6
data = [format(i,'016d').encode() for i in range(N)]
key = data[N-1]
# measure running time
%timeit data.index(key) # 16.7 ms ± 789 µs
But there is room for improvement. First, we know that all elements are bytes (or at least element we are searching for), so we don't need to invoke the dynamic dispatch - a costly machinery which would be able to compare instances of all classes. It is quite easy to do it with Cython - actually your version of index_cython
is already doing it:
%timeit index_cython(data, key) # 8.35 ms ± 216 µs
twice as fast! The reason is, that variable key
is typed as bytes
and thus u==q
can use a special version __Pyx_PyBytes_Equals
instead of generic and slower PyObject_RichCompare
.
Now, one can see in the C-file, that the yellow line in the for-loop is due to check, that the object isn't None
:
if (unlikely(__pyx_v_uuids == Py_None)) {
PyErr_SetString(PyExc_TypeError, "'NoneType' object is not subscriptable");
__PYX_ERR(0, 11, __pyx_L1_error)
}
this can be avoided, by making the check once and not in every iteration, by typing the function-declaration as:
def int index_cython_2(list uuids not None, bytes q):
...
by adding the not None
one shifts the check to the beginning of the function, where it happens only once. This version is somewhat faster:
%timeit index_cython_2(data, key) # 7.88 ms ± 194 µs
Another (common) issue with Python objects is their memory layout: They have some memory overhead - like PyObject_VAR_HEAD
- and even being next to each other in the list could mean that the addresses where the actual data is saved are far apart, which would lead to many cache misses.
The way our list was created actually leads to bytes object being next to each other in memory (this is an implementation detail of CPython), to really see the effect we can shuffle the list first:
import random
sh_data = list(data)
random.shuffle(sh_data)
sh_key=sh_data[N-1]
%timeit index_cython_2(sh_data, sh_key) # 50.4 ms ± 2.05 ms
the impact of cache misses: about 6 times slower!
As we know all elements are 16 byte long, we could use numpy arrays to fix the issue with fragmented memory:
sh_data_as_np = np.array(sh_data)
%timeit np.where(sh_data==sh_key)[0][0] # 8.4 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
sh_data_as_np
is of type |S16
, which means continuous memory were every 16 bytes (without trailing \0
) are one element. We are back to the old performance, independent of the order of elements.
However, np.where
doesn't short-cut and runs through the whole array even if the first element is the one we a looking for. Let's once again use Cython to improve on it (this approach is used for accessing the data):
%%cython
cimport cython
cimport numpy as np
from libc.string cimport memcmp
@cython.boundscheck(False)
@cython.wraparound(False)
def search_cython_c( np.uint8_t[::1] data, np.uint8_t[::1] key):
cdef int size = len(key)
cdef int n = len(data)//size
cdef int i
for i in range(n):
if memcmp(<void*>&key[0], <void*>&data[i*size], size) == 0:
return i
raise ValueError
and now:
%timeit search_cython_c(sh_data_as_np.view(np.uint8), np.array([sh_data[N-1]]).view(np.uint8)) # 4.1 ms ± 148 µs
%timeit search_cython_c(sh_data_as_np.view(np.uint8), np.array([sh_data[N//2]]).view(np.uint8)) # 2.15 ms ± 118 µs
as one can see, the element from the middle is found twice as fast than the element from the end (which is good) and even the worst case is 2 times faster than numpy version (even better!). It is hard to tell why cython outperforms numpy here, probably due to better optimization of the c-code.
It is still not the end: one could use parallelization or try to improve on memcmp
by utilizing, that there are always 16 bytes to compare.