Seemingly nonsensical runtime increases when switching from pure C to C with Numpy objects

Question

Introduction

I am trying to realise some number crunching on a one-dimensional array in C (herafter: standalone) and as a Numpy module written in C (herafter: module) simultaneously. Since all I need to do with the array is to compare selected elements, I could use an abstraction layer for the array access and thus I can use the same code for the standalone or the module.

Now, I expect the module to be somewhat slower, since comparing elements of a Numpy array of unknown type using descr->f->compare requires extra function calls and similar and thus is more costly than the analogous operation for a C array of known type. However, when looking at the output of a profiler (Valgrind), I found runtime increases in the module for lines which have no obvious connection to the Python methods. I want to understand and avoid this, if possible.

Minimal example

Unfortunately, my minimal example is quite lengthy. Note that the Python variant is no real module anymore due to example reduction.

# include <stdlib.h>
# include <stdio.h>

# ifdef PYTHON
    # include <Python.h>
    # include <numpy/arrayobject.h>

    // Define Array creation and access routines for Python.

    typedef PyArrayObject * Array;

    static inline char diff_sign (Array T, int i, int j)
    {
        return T->descr->f->compare ( PyArray_GETPTR1(T,i), PyArray_GETPTR1(T,j), T );
    }

    Array create_array (int n)
    {
        npy_intp dims[1] = {n};
        Array T = (Array) PyArray_SimpleNew (1, dims, NPY_DOUBLE);
        for (int i=0; i<n; i++)
            {* (double *) PyArray_GETPTR1(T,i) = i;} // Line A
        return T;
    }
#endif

# ifdef STANDALONE
    // Define Array creation and access routines for standalone C.

    typedef double * Array;

    static inline char diff_sign (Array T, int i, int j)
    {
        return (T[i]>T[j]) - (T[i]<T[j]);
    }

    Array create_array (int n)
    {
        Array T = malloc (n*sizeof(double));
        for (int i=0; i<n; i++) {T[i] = i;} // Line B
        return T;
    }
# endif

int main()
{
    # ifdef PYTHON
        Py_Initialize();
        import_array();
    # endif

    // avoids that the compiler knows the values of certain variables at runtime.
    int volatile blur = 0;

    int n = 1000;
    Array T = create_array (n);

    # ifdef PYTHON
        for (int i=0; i<n; i++)
            {* (double *) PyArray_GETPTR1(T,i) = i;} // Line C
    # endif
    # ifdef STANDALONE
        for (int i=0; i<n; i++) {T[i] = i;} // Line D
    #endif

    int a = 333 + blur;
    int b = 444 + blur;
    int c = 555 + blur;
    int e = 666 + blur;
    int f = 777 + blur;
    int g =   1 + blur;
    int h =   n + blur;

    // Line E                           standa.  module
    for (int i=h; i>0; i--)             // 4000  8998
    {
        int d = c;
        do c = (c+a)%b;                 // 4000  5000
        while (c>n-1);                  // 2000  2000

        if (c==e) f*=2;                 // 3000  3000
        if ( diff_sign(T,c,d)==g ) f++; // 5000  5000
    }

    printf("%i\n", f);
}

I compiled this with the following two commands:

gcc source.c -o standalone -O3 -g -std=c11 -D STANDALONE
gcc source.c -o module -O3 -g -std=c11 -D PYTHON -lpython2.7 -I/usr/include/python2.7

Changing to -O2 does not change the following; changing the compiler to Clang does change the minimal example but not the phenomenon with my actual code.

Profiling results

The interesting things happen after Line E and I gave the total runtime spent in those lines as reported by the profiler as comments in the source code: Despite having no direct relation to whether I compile as standalone or module, the runtimes for these lines strongly differ. In particular, in my actual application, the additional time spent in those lines in the module makes up for one fourth of the module’s total runtime.

What’s even more weird is that if I remove line C (and D) – which is redundant in the example, as the array’s values are already set (to the same values) in line A (and B) –, the runtime spent in the loop header is reduced from 8998 to 6002 (the other reported runtimes do not change). The same thing happens, if I change int n = 1000; to int n = 1000 + blur;, i.e., if I make n unknown compile time.

This does not make much sense to me and since it has a relevant impact on the runtime, I would like to avoid it.

Questions

Where do these runtime increases come from. I am aware that compilers are not perfect and sometimes work in seemingly mysterious ways, but I would like to understand.
How can I avoid these runtime increases?

If those timing differences are repeatable, and if you have correctly associated them with source lines, then I'd be inclined to guess that something declared in `Python.h` or `numpy/arrayobject.h` is causing that code to be compiled differently. It might be as simple as certain optimizations being disabled via pragma or custom preprocessor directive. — John Bollinger, Feb 19 '15 at 18:36
@JohnBollinger: That would not explain the phenomenon I reported in the second paragraph of *Profiling results.* — Wrzlprmft, Feb 19 '15 at 18:48
On the contrary, the fact that performance below line E is affected by the presence or absence of line C makes it *even more* likely that differing optimization choices by the compiler are responsible for at least part of the timing differences you observe. Still, upon reflection, I think it's more likely the inherent nature of the header contents that affects the optimizations chosen, rather than explicit switches disabling specific optimizations. — John Bollinger, Feb 19 '15 at 19:05
@JohnBollinger: I fail to follow you. How can including a line that is identical to a previous line cause additional pragmas or preprocessor directives to become active? I also found another way to trigger this change, that has nothing to do with any library functions at all (see my edit). — Wrzlprmft, Feb 19 '15 at 20:16
I did not say that including line C or not might cause different pragmas or preprocessing directives to be active. I said it very well could lead the compiler to make different optimization decisions. The compiler chooses among a great many specific optimization approaches, and the standard -O*n* options say only which ones it *may* use (in a compiler-specific manner). — John Bollinger, Feb 19 '15 at 20:22
As for the updates to your question, the more you find distinct modifications to the code that alter the performance characteristics of other parts, the more it sounds like the performance differences arise from the compiler making different optimization choices. If you want to try to rule that out then compile with `-O0` and compare. — John Bollinger, Feb 19 '15 at 20:30
@JohnBollinger: With `-O0`, the distinctions unsurprisingly vanish. As I wrote, I am well aware that compilers make imperfect choices. My question is what triggers these choices here (since I have never seen such bizzare effects) and how I can avoid them. — Wrzlprmft, Feb 19 '15 at 21:04
To the extent that differing optimization choices explain the performance difference, not much can be said at a general level about why the compiler chooses the approach it does. That depends on exactly which compiler you use, and on the details of the code. If your compiler affords the opportunity to enable / disable specific optimizations (as GCC does, for example) then you could experiment with that. You could also consider looking at the assembly listings to see how they are differ . — John Bollinger, Feb 19 '15 at 21:50

jtaylor · Accepted Answer · 2015-02-19T22:01:13.800

you have to be very careful when interpreting callgrind profiles. Callgrind gives you the instruction fetch count, so the number of instructions. This is not connected to actual performance on modern cpus, as instructions can have different latencies and throughputs and can be reordered by suitably capable cpus.

Also you are here matching the instruction fetch to the lines the debug symbols associate to them. Those do not correspond exactly, e.g. the module code associates the a register copy and a nop instruction (which are essentially free in terms of runtime compared to the following division) to the loop line the source code, while the standalone module associates it to the line above. You can see that in the machine code tab when using --dump-instr=yes in kcachegrind. This is will have something to do with different registers being available for the two variants due to the different number of function calls that imply spilling stuff onto the stack.

Lets look at the modulo loops to see if there is a significant runtime difference:

module:

  400b58:       42 8d 04 3b             lea    (%rbx,%r15,1),%eax
  400b5c:       99                      cltd   
  400b5d:       41 f7 fe                idiv   %r14d
  400b60:       81 fa e7 03 00 00       cmp    $0x3e7,%edx
  400b66:       89 d3                   mov    %edx,%ebx
  400b68:       7f ee                   jg     400b58 <main+0x1b8>

standalone:

  4005f0:       8d 04 32                lea    (%rdx,%rsi,1),%eax
  4005f3:       99                      cltd   
  4005f4:       f7 f9                   idiv   %ecx
  4005f6:       81 fa e7 03 00 00       cmp    $0x3e7,%edx
  4005fc:       7f f2                   jg     4005f0 <main+0x140>

the difference is one register to register copy mov %edx,%ebx (likely again caused by different register pressure due to earlier function calls) this is one of the cheapest operations available in a cpu probably around 1-2 cycles and good throughput, so it should have no measurable effect on the actual wall time. The idiv instruction is the expensive part, it should be around 20 cycles with poor throughput. So the instruction fetch count here is grossly misleading.

A better tool for such detailed profiling is a sampling profiler like perf record/report. When you run long enough you will be able to single out instructions that are costing a lot of time, though the actually high sample counts will also then not match up directly with the slow instructions as the cpu may execute later independent instructions in parallel with the slow ones.

Seemingly nonsensical runtime increases when switching from pure C to C with Numpy objects

Introduction

Minimal example

Profiling results

Questions

1 Answers1