3

I have cython a class which looks like this:

cdef class Cls:

    cdef func1(self):
        pass

If I use this class in another library, will I be able to inline func1 which is a class method? Or should I find a way around it (by creating a func that takes a Cls pointer as an arg, for example?

The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
  • Possible duplicate of [Using self-defined Cython code from other Cython code](https://stackoverflow.com/questions/5331016/using-self-defined-cython-code-from-other-cython-code) – DavidW May 09 '18 at 21:27
  • I know the linked question is a little more involved but the answer is the same so I think it's an appropriate duplicate – DavidW May 09 '18 at 21:28
  • @DavidW If this question is about "how can I make it work?" than this is a duplicate. If this question is about "why can I call `func1` and not `func2`?" or "Is cython able to inline `func2` in another module?" Then this is a different (and even quite interesting) question. The question might need a little bit of polishing though, so it is immediately clear what is asked. – ead May 10 '18 at 06:28
  • I agree with you both! Mine is a duplicate, but I was not able to find it on google. I'll change to the inlining question. – The Unfun Cat May 10 '18 at 06:33

1 Answers1

4

There are bad and good news: The inlining isn't possible from the other module, but you don't have to pay the full price of a Python-function-call.

What is inlining? It is done by the C-compiler: when the C-compiler knows the definition of a function it can decide to inline it. This has two advantages:

  1. You don't have to pay the overhead of calling a function
  2. It makes further optimizations possible.

See for example:

%%cython -a
ctypedef unsigned long long ull
cdef ull doit(ull a):
    return a

def calc_sum_fun():
    cdef ull res=0
    cdef ull i
    for i in range(1000000000):#10**9
        res+=doit(i)
    return res

>>> %timeit calc_sum_fun()
53.4 ns ± 1.4 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

How was it possible to do 10^9 additions in 53 nanoseconds? Because it was not done: The C-Compiler inlined the cdef doit() and was able to calculate the result of the loop during the compiler time. So during the run time the program simple returns the precomputed result.

It is pretty obvious from there, that C compiler will not be able to inline a function from another module, because the definition is concealed from it in another c-file/translation-unit. As example see:

#simple.pdx:
ctypedef unsigned long long ull
cdef ull doit(ull a)

#simple.pyx:
cdef ull doit(ull a):
    return a
def doit_slow(a):
    return a

and now accessing it from another cython module:

%%cython
cimport simple
ctypedef unsigned long long ull
def calc_sum_fun():
    cdef ull res=0
    cdef ull i
    for i in range(10000000):#10**7
        res+=doit(i)
    return res

leads to the following timings:

>>> %timeit calc_sum_fun()
17.8 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Because the inlining was not possible, the function really has to do the loop... However, it does it faster than a normal python-call, which we can do by replacing cdef doit() through def doit_slow():

%%cython
import simple              #import, not cimport

ctypedef unsigned long long ull
def calc_sum_fun_slow():
    cdef ull res=0
    cdef ull i
    for i in range(10000000):#10**7
        res+=simple.doit_slow(i)      #slow
    return res

Python-call is about 50 times slower!

>>> %timeit calc_sum_fun_slow()
1.07 s ± 20.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But you asked about class-methods and not global functions. For class-methods the inlining is not possible even in the same module:

%%cython

ctypedef unsigned long long ull

cdef class A:
    cdef ull doit(self, ull a):
        return a

def calc_sum_class():
    cdef ull res=0
    cdef ull i
    cdef A a=A()
    for i in range(10000000):#10**7
        res+=a.doit(i)      
    return res

Leads to:

>>> %timeit calc_sum_class()
18.2 ms ± 264 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

which is basically the same as in the case, where the cdef class is defined in another module.

The reason for this behavior is the way a cdef-class is build. It is a lot unlike virtual classes in C++ - the class definition has something similar to a virtual table called __pyx_vtab:

struct __pyx_obj_12simple_class_A {
  PyObject_HEAD
  struct __pyx_vtabstruct_12simple_class_A *__pyx_vtab;
};

where the pointer to cdef doit() is saved:

struct __pyx_vtabstruct_12simple_class_A {
   __pyx_t_12simple_class_ull (*doit)(struct __pyx_obj_12simple_class_A *, __pyx_t_12simple_class_ull);
};

When we call a.doit() we don't call the function directly but via this pointer:

((struct __pyx_vtabstruct_12simple_class_A *)__pyx_v_a->__pyx_vtab)->doit(__pyx_v_a, __pyx_v_i);

which explains why the C-compiler cannot inline the function doit().

ead
  • 32,758
  • 6
  • 90
  • 153