Introduction
I am trying to realise some number crunching on a one-dimensional array in C (herafter: standalone) and as a Numpy module written in C (herafter: module) simultaneously. Since all I need to do with the array is to compare selected elements, I could use an abstraction layer for the array access and thus I can use the same code for the standalone or the module.
Now, I expect the module to be somewhat slower, since comparing elements of a Numpy array of unknown type using descr->f->compare
requires extra function calls and similar and thus is more costly than the analogous operation for a C array of known type. However, when looking at the output of a profiler (Valgrind), I found runtime increases in the module for lines which have no obvious connection to the Python methods. I want to understand and avoid this, if possible.
Minimal example
Unfortunately, my minimal example is quite lengthy. Note that the Python variant is no real module anymore due to example reduction.
# include <stdlib.h>
# include <stdio.h>
# ifdef PYTHON
# include <Python.h>
# include <numpy/arrayobject.h>
// Define Array creation and access routines for Python.
typedef PyArrayObject * Array;
static inline char diff_sign (Array T, int i, int j)
{
return T->descr->f->compare ( PyArray_GETPTR1(T,i), PyArray_GETPTR1(T,j), T );
}
Array create_array (int n)
{
npy_intp dims[1] = {n};
Array T = (Array) PyArray_SimpleNew (1, dims, NPY_DOUBLE);
for (int i=0; i<n; i++)
{* (double *) PyArray_GETPTR1(T,i) = i;} // Line A
return T;
}
#endif
# ifdef STANDALONE
// Define Array creation and access routines for standalone C.
typedef double * Array;
static inline char diff_sign (Array T, int i, int j)
{
return (T[i]>T[j]) - (T[i]<T[j]);
}
Array create_array (int n)
{
Array T = malloc (n*sizeof(double));
for (int i=0; i<n; i++) {T[i] = i;} // Line B
return T;
}
# endif
int main()
{
# ifdef PYTHON
Py_Initialize();
import_array();
# endif
// avoids that the compiler knows the values of certain variables at runtime.
int volatile blur = 0;
int n = 1000;
Array T = create_array (n);
# ifdef PYTHON
for (int i=0; i<n; i++)
{* (double *) PyArray_GETPTR1(T,i) = i;} // Line C
# endif
# ifdef STANDALONE
for (int i=0; i<n; i++) {T[i] = i;} // Line D
#endif
int a = 333 + blur;
int b = 444 + blur;
int c = 555 + blur;
int e = 666 + blur;
int f = 777 + blur;
int g = 1 + blur;
int h = n + blur;
// Line E standa. module
for (int i=h; i>0; i--) // 4000 8998
{
int d = c;
do c = (c+a)%b; // 4000 5000
while (c>n-1); // 2000 2000
if (c==e) f*=2; // 3000 3000
if ( diff_sign(T,c,d)==g ) f++; // 5000 5000
}
printf("%i\n", f);
}
I compiled this with the following two commands:
gcc source.c -o standalone -O3 -g -std=c11 -D STANDALONE
gcc source.c -o module -O3 -g -std=c11 -D PYTHON -lpython2.7 -I/usr/include/python2.7
Changing to -O2
does not change the following; changing the compiler to Clang does change the minimal example but not the phenomenon with my actual code.
Profiling results
The interesting things happen after Line E and I gave the total runtime spent in those lines as reported by the profiler as comments in the source code: Despite having no direct relation to whether I compile as standalone or module, the runtimes for these lines strongly differ. In particular, in my actual application, the additional time spent in those lines in the module makes up for one fourth of the module’s total runtime.
What’s even more weird is that if I remove line C (and D) – which is redundant in the example, as the array’s values are already set (to the same values) in line A (and B) –, the runtime spent in the loop header is reduced from 8998 to 6002 (the other reported runtimes do not change). The same thing happens, if I change int n = 1000;
to int n = 1000 + blur;
, i.e., if I make n
unknown compile time.
This does not make much sense to me and since it has a relevant impact on the runtime, I would like to avoid it.
Questions
- Where do these runtime increases come from. I am aware that compilers are not perfect and sometimes work in seemingly mysterious ways, but I would like to understand.
- How can I avoid these runtime increases?