-2

How do you benchmark a function? Looking at results from callgrind, I have found that my program spends a lot of time in pow. Since I do not need full working precision, I thought that I could create a look-up-table and use linear interpolation between the points in the table. To be able to evaluate the look-up-table approach, I need to measure time. So I did this:

#ifdef __WAND__
target[name[test2.exe] type[application] platform[;Windows]]
target[name[test2] type[application]]
#endif

#include <herbs/main/main.h>
#include <herbs/tictoc/tictoc.h>
#include <herbs/array_fixedsize/array_fixedsize.h>
#include <random>
#include <cstdio>
#include <cmath>

class GetRand
    {
    public:
        GetRand(double min,double max):U(min,max){}

        bool operator()(double* val,size_t n,size_t N)
            {
            *val=U(randsource);
            return 1;
            }

    private:
        std::mt19937 randsource;
        std::uniform_real_distribution<double> U;
    };

int MAIN(int argc,charsys_t* argv[])
    {
    Herbs::ArrayFixedsize<double> vals(1024*1024*128,GetRand(-4,4));

    const size_t N=16;
    auto n=N;
    while(n)
        {
        double start=0;
        auto ptr=vals.begin();
            {
            Herbs::TicToc timestamp(start);
            while(ptr!=vals.end())
                {
                pow(2,*ptr);
                ++ptr;
                }
            }
    //  I have set cpu-freq to 1.6 GHz using cpufreq-set
        printf("%.15g\t",1.6e9*start/vals.length());
        --n;
        }
    return 0;
    }

When running this program The output is about 2.25 cycles per iteration. This seems very low, since the implementation of pow seems to be (it callgrind gave me __ieee754_pow).

The benchmark loop in assembly looks like this when compiling for GNU/Linux on x86-64:

    call    _ZN5Herbs6TicTocC1ERd@PLT
    movq    %r14, %rbx
    .p2align 4,,10
    .p2align 3
.L28:
    vmovsd  (%rbx), %xmm1
    vucomisd    .LC6(%rip), %xmm1
    jb  .L25
    vmovsd  .LC7(%rip), %xmm0
    call    pow@PLT
.L25:
    addq    $8, %rbx
    cmpq    %r12, %rbx
    jne .L28
    movq    %rbp, %rdi
    call    _ZN5Herbs6TicTocD1Ev@PLT

At least powis called. Can I trust the output or is there some black magic that eliminates things.

user877329
  • 6,717
  • 8
  • 46
  • 88
  • Questions with code need to have a language tag. Otherwise how does anyone know what this is? – Thomas Aug 09 '14 at 13:34
  • @Thomas, Benchmarking has the same issues in all languages: Non-deterministic environment, Compiler and linker optimizations. The first may be eliminated by running the test more than once, the second is much trickier. – user877329 Aug 09 '14 at 13:39
  • Note that there is a function `exp2` if you only need powers of 2. – Marc Glisse Aug 09 '14 at 13:45
  • @MarcGlisse I tried this too, but it was 10 times slower on x86-MinGW if the benchmark algorithm works. But before answering which function that works best, I need to be sure that the benchmark works. – user877329 Aug 09 '14 at 13:48
  • Try `-fno-builtin-pow`. If you use a debugger or a tracer or anything, you will notice that pow is not actually called in your program. `jb` makes you skip the call. Or at least store the result of `pow` in a volatile variable. – Marc Glisse Aug 09 '14 at 14:05

1 Answers1

1

There are few things you need to consider when benchmarking functions.

1) Make sure that cache misses doesn't influence the results significantly. In your case you iterate through a large array of data where you get tons of cache misses. Use a smaller array instead which easily fits in L1 cache and loop through it several times.

2) Make sure that you have side effects from function calls you are profiling that compiler can't optimize these calls out. In your case compiler doesn't do a good job optimizing since pow() calls are not optimized out even though there are no side effects. Prefer using integer side effects to avoid anomalies in floating point performance (e.g. raw-cast float to uint32 and add them up instead of doing the addition with floats).

3) Unroll your loops few times to reduce the overhead of looping. Currently you perform only single pow per loop where the loop adds relatively large overhead for this simple function call.

4) Profile with full optimizations and inlining enabled.

5) Run profiling through multiple times to ensure that other processes doesn't influence your results. Pick the best result for comparison (i.e. least amount of interference from other processes).

JarkkoL
  • 1,898
  • 11
  • 17