Valgrind and time give opposite results

Question

I have some (Fortran) code which accumulates data into an array, essentially doing this:

complex,dimension(4000)::a,b
complex :: c
[...]
a=0.
do i=1,20000
    b=foo(...)
    c=bar(...)
    a=a+b*c
end do

Using callgrind, I learn that most of my program's effort is in executing the line

a=a+b*c

so I'm interested in whether I can do anything to accelerate this. As a starting point, I have tried using the BLAS libraries optimized for my system, and replacing this line with

call caxpy(4000,c,b,1,a,1)

Callgrind reports this reduces the 'Ir count' for the entire program by about 40%. However, the execution time as measured by 'time' increases, by around 20%.

I expected that run time should be roughly proportional to the number of instructions executed, and so the two measures ought to give comparable results (time reports 99% CPU usage). What am I missing here?

Also, which BLAS implementation? And are your a,b and c arrays or scalars then? More clear piece of code would be preferable. Especially a compilable one which we could try. How expensive are `foo` and `bar`? Try look for the overhead of calling `caxpy` and try to increase the array size. — Vladimir F Героям слава, May 28 '15 at 19:30
@VladimirF - I'm using the BLAS packaged within the AMD Core Math Library. As declared, a and b are arrays; c is scalar. Unfortunately my real code is several thousand lines long, so posting it here is impossible. foo() essentially loads data from disk; bar() is some cheap arithmetic. I will look into the caxpy overhead - but I'm still puzzled by why this would be reported by time as user (not system) costs, yet not be reflected in callgrind's instruction count. — avid, May 28 '15 at 19:56
Not only the instruction count counts, don't forget memory accesses (stack modifications for the subroutine call) and the CPU pipeline inefficiencies when jumping far. — Vladimir F Героям слава, May 28 '15 at 19:58

Valgrind and time give opposite results

0 Answers0