4

Spoiler: The test program did nothing with the results, so the contents of the loops were removed by the optimizing compiler and thus looping over nothing takes about the same time each run... Anyway, i'll let the question and answers remain in case someone (me?) makes the same mistake (again...).

Original post: I wanted to test how much slower it is to compute the square root compare to a simple addition and wrote the little program below. The result i got is that it takes about the same amount of time, 0.3 seconds in this case. What is going on here?

program sqtest
implicit none
real r, s
integer i,j,n, sq, t

sq=11
n=100000000
r=1.11

if (sq==1) then
 do i = 1,n
  s = sqrt(float(i)*r)
 enddo
 write(*,*) "squareroot"
else
 do j = 1,n
  t = j+4
 enddo
 write(*,*) "plus"
endif


end program

put sq=1 to use the square root. The square root loop aso makes a multiplication and a conversion from int to float.

Jonatan Öström
  • 2,428
  • 1
  • 16
  • 27
  • Yes that can be the case. If I write to file it takes a lot longer, but then i/o could be limiting. I guess I'll have to make a more thought through test. I use gfortran and my laptop and os are 64 bit, and 8gb if ram. I don't know the exact answer to your question though. – Jonatan Öström Jun 24 '16 at 11:58
  • When you do such tests, you should check the assembly (using `objdump -d` for example) to be sure that the compiler produced the code you expect. In this case, I believe that the compiler removed the do loops because the results are not used. You should eventually do `s = s+sqrt(float(i)*r)` and print the value after the loop to avoid the code removal. Also, the conversion from int to float is expensive and you should probably do `i_float = 0. ; do i=1,n ; i_float = i_float + 1. ; ...` to avoid the conversion. – Anthony Scemama Jun 24 '16 at 13:52

3 Answers3

3

There are many things to consider when doing such tests. You have to clearly define what you are comparing in the first place. For such simple test, you should also deactivate the optimization, most major compilers accept the option -O0 to deactivate the optimization. Otherwise, a compiler will find out that you are not doing anything with the computed value and not even run your loop because it is useless.

To cut it short, I modify a little bit your program to have this

program sqtest
implicit none
real r0, r1, r2, s
integer i,n
real :: start, finish


    n=10**9
    call random_number(r0)
    call random_number(r1)
    call random_number(r2)


    call cpu_time(start)
    do i = 1,n
        s = sqrt(r0)
    enddo
    call cpu_time(finish)
    print '("SQRT:      Time = ",f6.3," seconds.")',finish-start

    call cpu_time(start)
    do i = 1,n
        s = r1+r2
    enddo
    call cpu_time(finish)
    print '("Addtition: Time = ",f6.3," seconds.")',finish-start

end program

And it gives me the following results on my system:

ifort 13, n = 10^8
SQRT:      Time =  0.378 seconds
Addtition: Time =  0.202 seconds

ifort 13, n = 10^9
SQRT:      Time =  3.460 seconds
Addtition: Time =  1.857 seconds

gfortran (GCC) 4.9, n = 10^8
SQRT:      Time =  0.385 seconds
Addtition: Time =  0.191 seconds

gfortran (GCC) 4.9, n = 10^9
SQRT:      Time =  3.529 seconds
Addtition: Time =  1.733 seconds

pgf90 14, n = 10^8
SQRT:      Time =  0.380 seconds
Addtition: Time =  0.058 seconds

pgf90 14, n = 10^9
SQRT:      Time =  3.438 seconds
Addtition: Time =  0.520 sec

You will note that I call the CPU time inside the code. For the numbers to be meaningful, you should run each case many time and compute the time average or pick the minimum. The minimum is what is close to what your system can achieve in the optimal conditions. You will also see that the result is compiler dependent. pgf90 clearly gives better results on the addition. I removed float(i)* from the square root. gfortran and pgf90 perform very fast with that (~ 2.6 sec for n = 10^9) while ifort performs very slowly (~7.3 sec for n = 10^9). Which means that somehow gfortran and pgf90 are choosing different path (faster operation) there, maybe they do some optimization even though I disabled it?

innoSPG
  • 4,588
  • 1
  • 29
  • 42
  • I find this very helpful and interesting! Thanks a lot! My initial query was for code design, and what the penalty is for calculating square roots. It seems like it is indeed very small. Though not zero, as my test puzzled me with. I have seen figures saying the operation are orders of magnitude more expensive than addition or multiplication. – Jonatan Öström Jun 24 '16 at 18:47
  • You are welcome! Figures that say operations are orders of magnitude more expensive than addition or multiplication are very right. The difference in modern architecture is the pipeline. Here we are taking full advantage of the pipeline. Since all of the iterations can be run in parallel, the difference is in the reciprocal throughput. If that value is for example 1 for the `add` and 2 for the `sqrt`, `sqrt` will be only 2 times slower even if `sqrt` takes 100 times more cycles than `add`, because of the huge number of iterations that we have. – innoSPG Jun 24 '16 at 20:55
  • I'd posit that the relative performance is likely very processor-dependent as well. For example, the Intel Fortran Compiler (ifort) could be expected to be a lot better for Intel-created chips. – jvriesem May 01 '18 at 11:32
2

You will find the cost of a hardware square root in this document : http://www.agner.org/optimize/instruction_tables.pdf .

The sqrt can be computed in different ways. In general, it is an iterative process involving only add and multiply operations. Usually the sqrt is computed as sqrt(x) = x * (1/sqrt(x)), because (1/sqrt(x)) can be computed faster than sqrt(x).

If you take a Haswell CPU, the latency of a single SQRTSS instruction is 11 cycles in single precision and 16 cycles in double precision (SQRTSD). In single precision it needs less iterations to converge to the desired accuracy than in double precision. On the same CPU, there is an approximate version of the sqrt (RSQRTSS) with a latency of 1 cycle, so if you ask for aggressive optimizations, you compiler may choose to generate this instruction.

If you need multiple independent square roots, like in your example, the code can be automatically vectorized by the compiler. There exists the vectorized variant VSQRTPS with a reciprocal throughput throughput of 14. In that case, you will get roughly an average of 14/8 = 1.75 cycles per sqrt.

References:

Community
  • 1
  • 1
Anthony Scemama
  • 1,563
  • 12
  • 19
  • Nice stuff, thanks! Do you mean it is vectorized for SIMD in a single core? Because as far as I know there is no automatic multithreading across cores. – Jonatan Öström Jun 24 '16 at 18:51
1

Maybe your compiler is optimizing away the code. You can test this by measuring with different orders of n (e.g.1e6, 1e7, 1e8, .., 1e10) and see how the time scales. Btw, what is the allowed range for integer on your machine/compiler?

lalala
  • 111
  • 1
  • 1
  • 5
  • this is a answer, it needs a bit of elaboration though – agentp Jun 24 '16 at 11:41
  • Yes that can be the case. If I write to file it takes a lot longer, but then i/o could be limiting. I guess I'll have to make a more thought through test. I use gfortran and my laptop and os are 64 bit, and 8gb if ram. I don't know the exact answer to your question though. – Jonatan Öström Jun 24 '16 at 11:59
  • 2
    Any halfway-decent optimizing compiler would look at that program, realize it didn't do any work and remove all the code except the print statements. I think it is a waste of effort trying to time individual operations, especially if you are unfamiliar with how to write benchmarks that actually test what you are looking for. – Steve Lionel Jun 24 '16 at 13:01