I've mangled Cython badly, it's performing worse than pure Python. Why?

Question

I'm rather new to Python and absolutely ignorant of C (unfortunately) so I am struggling to properly understand some aspects of working with Cython.

After profiling a Python program and discovering that it was just a couple of loops that were hogging most of the time, I decided to look into dumping them into Cython. Initially, I just let Cython interpret the Python as it was, and the result was a (remarkable!) ~2x speed boost. Cool!

From the Python main, I pass the function two 2-D arrays ("a" and "b") and a float, "d", and it returns a list, "newlist". As examples:

a =[[12.7, 13.5, 1.0],[23.4, 43.1, 1.0],...]
b =[[0.46,0.95,0],[4.56,0.92,0],...]
d = 0.1

Here is the original code, with just the cdefs added for Cython:

def loop(a, b, d):

    cdef int i, j
    cdef double x, y

    newlist = []

    for i in range(len(a)):
        if b[i][2] != 1:
            for j in range(i+1,len(a)):
                if a[i] == a[j] and b[j][2] != 1:
                    x = b[i][0]+b[j][0]
                    y = b[i][1]+b[j][1]
                    b[i][2],b[j][2] = 1,1

                    if abs(y)/abs(x) > d:
                        if y > 0: newlist.append([a[i][0],a[i][1],y])

    return newlist

In "pure Python", this ran (with several ten-thousand loops) in ~12.5s. In Cython it ran in ~6.3s. Great progress with near-zero work done!

However, with a little reading, it was clear that much, much more could be done, so I set about trying to apply some type changes to get things going even faster, following the Cython docs, here (also referenced in the comments).

Here are the collected modifications, meant to mimic the Cython docs:

import numpy as np
cimport numpy as np

DtypeA = np.float
DtypeB = np.int

ctypedef np.float_t DtypeA_t
ctypedef np.int_t DtypeB_t

def loop(np.ndarray[DtypeA_t, ndim=2] A,
         np.ndarray[DtypeA_t, ndim=2] B,
         np.ndarray[DtypeB_t, ndim=1] C,
         float D):

    cdef Py_ssize_t i, j
    cdef float x, y

    cdef np.ndarray[DtypeA_t, ndim=2] NEW_ARRAY = np.zeros((len(C),3), dtype=DtypeA)

    for i in range(len(C)):
        if C[i] != 1:
            for j in range(i+1,len(C)):
                if A[i][0]==A[j][0] and A[i][1]==A[j][1] and C[j]!= 1:
                    x = B[i][0]+B[j][0]
                    y = B[i][1]+B[j][1]
                    C[i],C[j] = 1,1

                    if abs(y)/abs(x) > D:
                        if y > 0: NEW_ARRAY[i]=([A[i][0],A[i][1],y])

    return NEW_ARRAY

Among other things, I split the previous array "b" into two different input arrays "B" and "C", because each row of "b" contained 2 float elements and an integer that just acted as a flag. So I removed the flag integers and put them in a separate 1-D array, "C". So, the inputs now looked liked this:

A =[[12.7, 13.5, 1.0],[23.4, 43.1, 1.0],...]
B =[[0.46,0.95],[4.56,0.92],...]
C =[0,0,...]
D = 0.1

Ideally, this should go much faster with all the variables now being typed(?)...but obviously I'm doing something very wrong, because the function now comes in at a time of 35.3s...way WORSE than the "pure Python"!!

What am I botching so badly? Thanks for reading!

have you read this: http://docs.cython.org/src/tutorial/numpy.html ? — P.R., May 25 '15 at 15:43
yes, and I emulated the suggestions there as best I could, but I'm not sure how to deal with the appending of a list... — Nordlendingen, May 25 '15 at 16:38
I will try to figure out what the code does exactly, but I am guessing at the moment that using vectorized numpy operations might help — P.R., May 25 '15 at 16:56
I will explain the operation of the program and list some of the other things I've tried in an edit now. — Nordlendingen, May 25 '15 at 17:01
I'd recommend going for some algorithmic improvements. For example, lexicographically sorting `a` will show you which rows of `a` are equal, allowing you to avoid the costly double loop and possibly letting you get rid of the column of `b` you seem to be using to mark a row as processed. — user2357112, May 25 '15 at 23:25

score 5 · Accepted Answer · answered May 25 '15 at 22:30

5

I believe the use of the indexing notation b[j][0] may be throwing Cython off, making it impossible for it to use fast indexing operations behind the scenes. Incidentally, even in pure Python code this style is not idiomatic and may lead to slower code.

Try instead to use the notation b[j,0] throughout and see if it improves your performance.

answered May 25 '15 at 22:30

cfh

4,576
1
24
34

Wow, yes, that yielded a vastly improved performance...the latter version of the code ran at 0.2 seconds now! But, if I modify the former code in the same way, I get the error: "TypeError: list indices must be integers, not tuple" for line "if b[i][2] != 1:" – Nordlendingen May 25 '15 at 23:31
4

@Nordlendingen: This style of indexing won't work on lists of lists, as they are not inherently 2D arrays. For this type of code it's good to work with numpy `ndarrays` anyway. – cfh May 25 '15 at 23:34

ely · Answer 2 · 2015-05-25T22:19:10.820

Have you compiled manually with cython -a and inspected the line-by-line annotation? (If so, if you could post an image of it, or write-up something about what it tells you, that will help us). Lines highlighted with yellow indicate lines where the source to source transpiler output results in heavy use of the CPython API.

For example, you do cdef Py_ssize_t i, j but there's no reason why these can't be C integers. It takes overhead to treat them as Py_ssize_t and if they are merely used as an index in a simple loop with bounds you can easily guarantee, there's no need. I don't bring up Py_ssize_t to try to say don't use it generally. If your case involves needing to support 64-bit architectures or higher precision integers for the index, then of course use it. I'm just mentioning it because little things like this can sometimes have an unexpected and big impact on when/why a bunch of CPython API stuff gets roped into some Cython code that you thought would be free of the CPython API. Perhaps a better example in your code is the use of the construction of Python bool values and and inside the loop, instead of vectorized or C-level versions of these.

Such locations in Cython generally refer to locations where you won't get a speedup and often get a slowdown, especially if they are mixed in with NumPy code that, due to its more tightly optimized use of Cython / extension wrappers, wouldn't have otherwise had to deal with that CPython API overhead. You can let the cython -a output guide you in terms of adding C-level type declarations to replace Python types, or to use C-level functions like from the C math library, instead of Python operations that may require treating arguments, even when typed, as potentially being any sort of Python object, with all the many attribute lookups and dispatching calls that this involves.

Hmm, while your suggestion to use `cython -a` is good, isn't `Py_ssize_t` just typedef for a built-in C integral type capable of fitting in a single general purpose register? It doesn't seem likely to be a huge performance problem ... — SamB, Aug 11 '15 at 04:35
I tried to be clear that the optimization from `Py_ssize_t` to a C int is utterly minor (it's nothing but reducing precision that may not be needed). My point was just that little things like that can have unexpectedly big impacts. I also pointed out the use of `and` inside the loop, requiring `PyObject` evaluations, which again might not be a huge deal but is the sort of thing that seems innocuous when reading the code but can be expanded into heavily C-API-reliant code. — ely, Aug 12 '15 at 02:30

rth · Answer 3 · 2015-05-26T15:20:03.817

4

Showing annotations with cython -a is indeed very usefull to optimize Cython code. Here is a version that should be much faster and that uses a cleaner syntax with memoryviews,

# cython: boundscheck=False
# cython: cdivision=True
# cython: wraparound=False

import numpy as np
cimport numpy as np


def loop(double [:,::1] A, double [:,::1] B, long [::1] C, float D):

    cdef Py_ssize_t i, j, N
    cdef float x, y

    N = len(C)

    cdef double[:,::1] NEW_ARRAY = np.zeros((N,3), dtype='float64')

    for i in range(N):
        if C[i] != 1:
            for j in range(i+1, N):
                if (A[i,0]==A[j,0]) & (A[i,1]==A[j,1]) & (C[j] != 1):
                    x = B[i,0] + B[j,0]
                    y = B[i,1] + B[j,1]
                    C[i] = 1
                    C[j] = 1

                    if abs(y/x) > D and y >0:
                        NEW_ARRAY[i,0] = A[i,0]
                        NEW_ARRAY[i,1] = A[i,1]
                        NEW_ARRAY[i,2] = y

    return NEW_ARRAY.base

edited May 26 '15 at 15:20

answered May 25 '15 at 23:11

rth

10,680
7
53
77

rth and @Mr. F thanks for the very helpful suggestions and code cleanup. Much appreciated! – Nordlendingen May 25 '15 at 23:39
1

@Nordlendingen Btw, could you time the solution above with your data? It's more than about code clean up: memoryviews should be faster than the array interface you use in the question, and the cython directives in the header remove unnecessary overhead. – rth May 25 '15 at 23:42
1

Yes, indeed, your solution halved the run time once more. Fixing the indexing notation alone brought it down to 0.28 seconds...but your code brought it down further to 0.13 seconds. Thanks again for the really helpful insights. – Nordlendingen May 26 '15 at 00:07
1

I expect you could get a lot more speedup by eliminating the use of `and` and using bitwise operations instead, on C values. With `and` (as well as the *Python* expressions that evaluate to `bool` in those logical operations), you're causing a lot of the CPython API to need to be involved (e.g. to actually produce a temporary `bool` object to hold the value of `A[i,0] == A[j, 0]` and then dispatching on its `__and__` special method) when you don't need to. – ely May 26 '15 at 14:13
@Mr.F You are right about that, the produced C code does look cleaner and without temporary variables: annotated version [with `and`](http://r0h.eu/it/static/d/test.html) and the [one with `&`](http://r0h.eu/it/static/d/test_opt.html). However in the version with `and`, if the first condition `A[i,0]==A[j,0]` is false, the rest is not evaluated, while in the other one all of them have to be evaluated either way, so I'm not sure about relative speed-up, unless this handled at the compiler level? – rth May 26 '15 at 15:24
I expect that the extra fast lookup for obtaining the value for each expression is much cheaper than invoking the whole `__and__` overhead on *just* the first operand (in the case when short circuiting would help). That may not be right, but it's my hunch, so evaluating the whole thing every time would still be faster than paying the CPython API costs of `__and__` but getting the short circuit benefit. In the end, you could re-write a lot of this using a pure `cdef` C function and get rid of all CPython overhead, so you could still get both if that much optimization was necessary. – ely May 26 '15 at 15:35
Hmm, I don't think that the `__add__` is actually called. If you click to expand the line 21 on the [annotated version with an `and`](http://r0h.eu/it/static/d/test.html). All of it is pretty much pure C and a bunch of pointers, no CPython API calls there. Maybe there is better C code generation with the later Cython version? – rth May 26 '15 at 15:41
1

@rth: You might want to keep in mind that with native code, sometimes you waste more time on the conditional branches than you'd save by not having to evaluate the rest of the condition, what with the pipeline stalls you can run into and all ... – SamB Aug 11 '15 at 04:32
@SamB I'm aware that conditional branches have a performance cost, however how would avoid it here? I mean it's up to the compiler whether the all the condition in an if statement are checked, if say the first one is false, no? – rth Aug 11 '15 at 08:36

I've mangled Cython badly, it's performing worse than pure Python. Why?

3 Answers3

Linked