Taking a Set of a List created using C in Cython is much slower than pure Python - Why?

Question

In this example, I show two different methods for creating a list of strings using Cython. One uses an array of char pointers (and the strcpy C function) and the other by simply appending elements to a list.

I then pass each of these lists into the set function and see that performance is drastically different.

Question - What can I do to create the list using character pointers to have equal performance?

A simple function to create lists in Cython

from libc.string cimport strcpy

def make_lists():
    cdef:
        char c_list[100000][3]
        Py_ssize_t i
        list py_list = []

    for i in range(100000):
        strcpy(c_list[i], 'AB')
        c_list[i][2] = b'\0'
        py_list.append(b'AB')

    return c_list, py_list

Here, c_list is just an array of 3-length characters. Cython will return this object as a Python list. py_list is just a normal Python list. We are filling both lists with just a single sequence of bytes, 'AB'.

Create the lists

c_list, py_list = make_lists()

Print out some of the contents

>>> c_list[:10]
[b'AB', b'AB', b'AB', b'AB', b'AB', b'AB', b'AB', b'AB', b'AB', b'AB']

Show both lists are equal

>>> c_list == py_list
True

Time operations - this is insane to me! 3x difference

%timeit set(c_list)
2.85 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit set(py_list)
1.02 ms ± 26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Unicode and pure python

Interestingly, the performance difference vanishes if I decode each value to unicode, though it is slower than the original set(py_list). If I create a unicode list in pure Python then I am back to the original performance.

c_list_unicode = [v.decode() for v in c_list]
py_list_unicode = [v.decode() for v in py_list]
py_list_py = ['AB' for _ in range(len(py_list))]

%timeit set(c_list_unicode)
1.63 ms ± 56.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit set(py_list_unicode)
1.7 ms ± 35.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit set(py_list_py)
987 µs ± 45.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Even simpler example

def make_lists2():
    cdef:
        char *c_list[100000]
        Py_ssize_t i
        list py_list_slow = []
        list py_list_fast = []

    for i in range(100000):
        c_list[i] = 'AB'
        py_list_slow.append(c_list[i])
        py_list_fast.append(b'AB')

    return c_list, py_list_slow, py_list_fast

Timings

c_list2, py_list_slow, py_list_fast = make_lists2()

%timeit set(c_list2)
3.01 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit set(py_list_slow)
3.05 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit set(py_list_fast)
1.08 ms ± 38.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edit

Possible solution

I found the function PyUnicode_InternFromString in the unicode Python C API and am getting performance on par with regular python lists. This 'interns' the string - not sure what that means

This has little to do with `cython`. You are comparing the performance making a set from a list versus a general buffer (coming from C world). I would actually expect that performance drop. With your unicode example you are converting everything to python lists, hence the performance gap vanishes. — romeric, Mar 19 '18 at 16:48
But, both objects are Python lists, since I am taking the set in pure python. I'm guessing their underlying memory allocation is vastly different. Also, after conversion to unicode, the lists are still 2x as slow as the pure python unicode list. — Ted Petrou, Mar 19 '18 at 16:54
Could be. Converting from python list to python set should be more optimised than a list created externally. There might also be some safe-guarding happening in the latter case. — romeric, Mar 19 '18 at 17:02
This sounds almost certainly like it's because `c_list` is a list of different bytestrings and `py_list` is a list of 100000 references to the same bytestring. I don't have Cython installed, so I can't confirm. — user2357112, Mar 19 '18 at 18:36

score 4 · Answer 1 · answered Mar 19 '18 at 18:43

Your c_list is a list of 100000 distinct bytestrings with the same contents. Cython has to convert each char[3] to a bytestring separately, and it doesn't bother to do any object deduplication.

Your py_list is a list of the same bytestring object 100000 times. Every py_list.append(b'AB') appends the same object to py_list; without the trip through a C array, Cython never needs to copy the bytestring.

set(c_list) is slower than set(py_list) because set(c_list) has to actually perform string comparison, while set(py_list) gets to skip that with an object identity check.

Thank you for the explanation. Can you show repost my 'even simpler example' with the function `PyUnicode_InternFromString` and the new timings to show that they are now identical. — Ted Petrou, Mar 19 '18 at 19:04