I have some (Fortran) code which accumulates data into an array, essentially doing this:
complex,dimension(4000)::a,b
complex :: c
[...]
a=0.
do i=1,20000
b=foo(...)
c=bar(...)
a=a+b*c
end do
Using callgrind, I learn that most of my program's effort is in executing the line
a=a+b*c
so I'm interested in whether I can do anything to accelerate this. As a starting point, I have tried using the BLAS libraries optimized for my system, and replacing this line with
call caxpy(4000,c,b,1,a,1)
Callgrind reports this reduces the 'Ir count' for the entire program by about 40%. However, the execution time as measured by 'time' increases, by around 20%.
I expected that run time should be roughly proportional to the number of instructions executed, and so the two measures ought to give comparable results (time reports 99% CPU usage). What am I missing here?