How do you benchmark a function? Looking at results from callgrind, I have found that my program spends a lot of time in pow
. Since I do not need full working precision, I thought that I could create a look-up-table and use linear interpolation between the points in the table. To be able to evaluate the look-up-table approach, I need to measure time. So I did this:
#ifdef __WAND__
target[name[test2.exe] type[application] platform[;Windows]]
target[name[test2] type[application]]
#endif
#include <herbs/main/main.h>
#include <herbs/tictoc/tictoc.h>
#include <herbs/array_fixedsize/array_fixedsize.h>
#include <random>
#include <cstdio>
#include <cmath>
class GetRand
{
public:
GetRand(double min,double max):U(min,max){}
bool operator()(double* val,size_t n,size_t N)
{
*val=U(randsource);
return 1;
}
private:
std::mt19937 randsource;
std::uniform_real_distribution<double> U;
};
int MAIN(int argc,charsys_t* argv[])
{
Herbs::ArrayFixedsize<double> vals(1024*1024*128,GetRand(-4,4));
const size_t N=16;
auto n=N;
while(n)
{
double start=0;
auto ptr=vals.begin();
{
Herbs::TicToc timestamp(start);
while(ptr!=vals.end())
{
pow(2,*ptr);
++ptr;
}
}
// I have set cpu-freq to 1.6 GHz using cpufreq-set
printf("%.15g\t",1.6e9*start/vals.length());
--n;
}
return 0;
}
When running this program The output is about 2.25 cycles per iteration. This seems very low, since the implementation of pow
seems to be (it callgrind
gave me __ieee754_pow
).
The benchmark loop in assembly looks like this when compiling for GNU/Linux on x86-64:
call _ZN5Herbs6TicTocC1ERd@PLT
movq %r14, %rbx
.p2align 4,,10
.p2align 3
.L28:
vmovsd (%rbx), %xmm1
vucomisd .LC6(%rip), %xmm1
jb .L25
vmovsd .LC7(%rip), %xmm0
call pow@PLT
.L25:
addq $8, %rbx
cmpq %r12, %rbx
jne .L28
movq %rbp, %rdi
call _ZN5Herbs6TicTocD1Ev@PLT
At least pow
is called. Can I trust the output or is there some black magic that eliminates things.