Performance difference when manipulating ctypes, python

Question

I'm trying to understand how to integrate c-routines into my python scripts. I'm testing the addition of two numpy arrays.

I've got a C file, called test.c

void add(int count, float* array_a,float* array_b, float* array_c)
{
    int ii,jj;
    for (ii=0;ii<count;ii++){
        array_c[ii] = array_a[ii]+array_b[ii];
    }

}

One can compile this into a .so (shared object) with:

gcc -c -fPIC test.c -o test.o
gcc test.o -shared -o test.so

This will allow me to call the function "add" in python and execute the c code.

import numpy as np
import ctypes

size=100
a=np.ones(size).astype(np.float32)
b=np.ones(size).astype(np.float32)
c=np.zeros(size).astype(np.float32)
from numpy.ctypeslib import ndpointer

lib = ctypes.cdll.LoadLibrary('./test.so')
fun = lib.add #using the c-function
fun.restype = None
fun.argtypes = [ctypes.c_int,
                ndpointer(ctypes.c_float),
                ndpointer(ctypes.c_float),
                ndpointer(ctypes.c_float)]
#I'm giving the c-function the 3 pointers pointing to the a,b and c arrays in memory.

%timeit fun(size,a,b,c)
%timeit c=a+b

The c function requires 11us, while the numpy addition requires 442 ns. Where does this difference in timing come from ? Where is the hidden cost here ?

Most numpy distributions are highly optimized; many of its algorithms make use of parallelism. Your algorithm is 'naive' — that's not an insult :-) but means it is as simple and straight-forward as possible. You can make it much faster e.g. by using SSE instructions, multiple threads, or loop unrolling. — Norman, Feb 05 '17 at 19:33
How do the times change when `size` is increased. I suspect the calling overhead exceeds the computational cost when size is small like 100. When I increase size to 1000, my `timeit` only increases 3x. Another 10x increase, increases time 6x. Plus `timeit` still gives a caching warning. — hpaulj, Feb 05 '17 at 22:15
Thanks Norman, could you give a bit more details about the SSE instructions (are these done through compiler flags ?) and the loop unrolling ? — Mathusalem, Feb 06 '17 at 09:30
Hey Hpaulj, yes the timing difference becomes almost irrelevant for very large arrays. — Mathusalem, Feb 06 '17 at 09:31
I overlooked that `size` is very small, so I think hpaulj has your answer. Now that you say the difference vanishes with larger arrays, maybe gcc vectorizes your loop [automatically](http://stackoverflow.com/a/409302). So you don't even need to know [how use SSE instructions](http://stackoverflow.com/q/1389712) manually. [Loop unrolling](http://stackoverflow.com/q/2349211) probably won't benefit your example much. An easy method for parallelizing loops is using [OpenMP](https://gcc.gnu.org/wiki/openmp), which can even be combined with vectorization (SSE) for maximum effect. — Norman, Feb 07 '17 at 23:04

Performance difference when manipulating ctypes, python

0 Answers0