Fastest approach for iteration over iterable with elements of 16 bytes in cython

Question

I have a list of bytes in python and I would like to iterate over the list and find the first occurrence of a particular byte sequence q.

I wanted to use a numpy array but element types in a numpy array or native python array seem to be capped to 64 bits. I have no need to use a python list as input to the function. Is there another iterable that can make the cython code faster?

I have the following code

        01: cimport cython
        02: 
        03: @cython.boundscheck(False)
        04: @cython.wraparound(False)
>>>>>>>>05: cpdef int index_cython(list uuids, bytes q):
        06:     cdef:
        07:         int pos = 0, k = 0
>>>>>>>>08:         int n_uuids = len(uuids)
        09:         bytes u
        10: 
        11:     for k in range(n_uuids):
>>>>>>>>12:         u = uuids[k]
>>>>>>>>13:         if u == q:
        14:             return pos
        15:         else:
        16:             pos = pos +1
        17: 
        18:     raise ValueError

where the >>>>>>>> lines are shown in yellow in the annotated code produced by cython, hinting python interactions.

How can I avoid python interaction at line 12, 13 and 5?
Can I use a memoryview to do this? (if so, how?)
Can I leverage the fact that every element has a fixed size of 16 bytes?

ead · Answer 1 · 2022-04-14T04:38:34.757

There is list.index, which under the hood is written C and one would superficially assume one cannot beat that. But as shown bellow, a speed of 10-20 is possible by utilizing numpy and cython.

Let's establish the baseline:

# create data set:
N=10**6
data = [format(i,'016d').encode() for i in range(N)]
key = data[N-1]

# measure running time
%timeit data.index(key) # 16.7 ms ± 789 µs

But there is room for improvement. First, we know that all elements are bytes (or at least element we are searching for), so we don't need to invoke the dynamic dispatch - a costly machinery which would be able to compare instances of all classes. It is quite easy to do it with Cython - actually your version of index_cython is already doing it:

%timeit index_cython(data, key)  # 8.35 ms ± 216 µs

twice as fast! The reason is, that variable key is typed as bytes and thus u==q can use a special version __Pyx_PyBytes_Equals instead of generic and slower PyObject_RichCompare.

Now, one can see in the C-file, that the yellow line in the for-loop is due to check, that the object isn't None:

if (unlikely(__pyx_v_uuids == Py_None)) {
      PyErr_SetString(PyExc_TypeError, "'NoneType' object is not subscriptable");
      __PYX_ERR(0, 11, __pyx_L1_error)
    }

this can be avoided, by making the check once and not in every iteration, by typing the function-declaration as:

   def int index_cython_2(list uuids not None, bytes q):
      ...

by adding the not None one shifts the check to the beginning of the function, where it happens only once. This version is somewhat faster:

%timeit index_cython_2(data, key)  # 7.88 ms ± 194 µs

Another (common) issue with Python objects is their memory layout: They have some memory overhead - like PyObject_VAR_HEAD - and even being next to each other in the list could mean that the addresses where the actual data is saved are far apart, which would lead to many cache misses.

The way our list was created actually leads to bytes object being next to each other in memory (this is an implementation detail of CPython), to really see the effect we can shuffle the list first:

import random
sh_data = list(data)
random.shuffle(sh_data)
sh_key=sh_data[N-1]

%timeit index_cython_2(sh_data, sh_key) # 50.4 ms ± 2.05 ms

the impact of cache misses: about 6 times slower!

As we know all elements are 16 byte long, we could use numpy arrays to fix the issue with fragmented memory:

sh_data_as_np = np.array(sh_data)

%timeit np.where(sh_data==sh_key)[0][0] # 8.4 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

sh_data_as_np is of type |S16, which means continuous memory were every 16 bytes (without trailing \0) are one element. We are back to the old performance, independent of the order of elements.

However, np.where doesn't short-cut and runs through the whole array even if the first element is the one we a looking for. Let's once again use Cython to improve on it (this approach is used for accessing the data):

%%cython
cimport cython
cimport numpy as np
from libc.string cimport memcmp


@cython.boundscheck(False)
@cython.wraparound(False)
def search_cython_c( np.uint8_t[::1] data, np.uint8_t[::1] key):
    cdef int size = len(key)
    cdef int n = len(data)//size
    cdef int i
    for i in range(n):
        if memcmp(<void*>&key[0], <void*>&data[i*size], size) == 0:
            return i
    raise ValueError

and now:

%timeit search_cython_c(sh_data_as_np.view(np.uint8), np.array([sh_data[N-1]]).view(np.uint8))  # 4.1 ms ± 148 µs
%timeit search_cython_c(sh_data_as_np.view(np.uint8), np.array([sh_data[N//2]]).view(np.uint8))  # 2.15 ms ± 118 µs

as one can see, the element from the middle is found twice as fast than the element from the end (which is good) and even the worst case is 2 times faster than numpy version (even better!). It is hard to tell why cython outperforms numpy here, probably due to better optimization of the c-code.

It is still not the end: one could use parallelization or try to improve on memcmp by utilizing, that there are always 16 bytes to compare.

Awesome explanation, thanks a lot! I tried though `index_cython_2` of the form `cpdef int index_cython_2(list uuids not None, bytes q): ` but got `.ipython/cython/_cython_magic_c74b1c61a42ba5a4a8afdf7390136595.pyx:5:25: 'not None' only allowed in Python functions`. I am using Cython version '0.29.28' — David Buchaca, Apr 13 '22 at 21:26
@DavidBuchaca true it was a typo. It should be `def` and not `cpdef`. — ead, Apr 14 '22 at 04:39

Fastest approach for iteration over iterable with elements of 16 bytes in cython

1 Answers1