3

I'm working with algorithms using a large amount of maths functions, and recently we ported the code under g++ 4.8.2 on an Ubuntu system from a Solaris platform.

Surprisingly, some of the algorithms were taking way much time than before. The reason behind is that the std::tan() function is two times longer than doing std::sin()/std::cos().

Replacing the tan by sin/cos has considerably reduced the computing time for the same results. I wonder why there is such a difference. Is it because of the implementation of the standard library ? Shouldn't the tan function be more effective ?

I wrote a program to check the time of the functions :

#include <cmath>
#include <iostream>
#include <chrono>

int main(int argc, char * argv[])
{
    using namespace std::chrono;

    auto start_tan = system_clock::now();

    for (int i = 0; i < 50000; ++i)
    {
        const double & a = static_cast<double>(i);
        const double & b = std::tan(a);
    }

    auto end_tan = system_clock::now();
    auto elapsed_time_tan = end_tan - start_tan;
    std::cout << "tan : ";
    std::cout << elapsed_time_tan.count() << std::endl;

    auto start_sincos = system_clock::now();

    for (int i =  0; i < 50000; ++i)
    {
        const double & a = static_cast<double>(i);
        const double & b = std::sin(a) / std::cos(a);
    }

    auto end_sincos = system_clock::now();
    auto elapsed_time_sincos = end_sincos - start_sincos;
    std::cout << "sincos : " << elapsed_time_sincos.count() << std::endl;

}

And indeed, in the output I have the following time without optimisation :

tan : 8319960
sincos : 4736988

And with optimisation (-O2) :

tan : 294
sincos : 120

If anyone has any idea about this behaviour.

EDIT

I modified the program according to @Basile Starynkevitch response :

#include <cmath>
#include <iostream>
#include <chrono>

int main(int argc, char * argv[])
{
    using namespace std::chrono;

   if (argc != 2) 
   {
      std::cout << "Need one and only argument : the number of iteration." << std::endl;
      return 1;
   }

   int nb_iter = std::atoi(argv[1]);
   std::cout << "Number of iteration programmed : " << nb_iter << std::endl;


   double tan_sum = 0.0;
   auto start_tan = system_clock::now();
    for (int i = 0; i < nb_iter; ++i)
    {
        const double & a = static_cast<double>(i);
        const double b = std::tan(a);
      tan_sum += b;
    }

    auto end_tan = system_clock::now();
    auto elapsed_time_tan = end_tan - start_tan;
    std::cout << "tan : " << elapsed_time_tan.count() << std::endl;
   std::cout << "tan sum : " << tan_sum << std::endl;

   double sincos_sum = 0.0;
    auto start_sincos = system_clock::now();
    for (int i =  0; i < nb_iter; ++i)
    {
        const double & a = static_cast<double>(i);
        const double b = std::sin(a) / std::cos(a);
      sincos_sum += b;
    }

    auto end_sincos = system_clock::now();
    auto elapsed_time_sincos = end_sincos - start_sincos;
    std::cout << "sincos : " << elapsed_time_sincos.count() << std::endl;
   std::cout << "sincos sum : " << sincos_sum << std::endl;

}

And now as result I get similar time for -O2only :

tan : 8345021
sincos : 7838740

But still the difference with -O2 -mtune=native, but faster indeed :

tan : 5426201
sincos : 3721938

I won't user -ffast-math because I need to keep IEEE compliance.

dkg
  • 1,775
  • 14
  • 34
  • 4
    In the optimized version, the whole loop is probably being optimized out because it has no effect on the behavior of the program. – interjay Jan 06 '15 at 12:56
  • For what it's worth, [this improved version of the benchmark](http://coliru.stacked-crooked.com/a/0fe7b6e44ec703eb) gives me the same result as the OP (sin/cos is faster), even locally with the Intel Compiler. – rubenvb Jan 06 '15 at 13:22
  • Should I conclude that there is optimisations made for specific functions by the processor ? – dkg Jan 06 '15 at 13:32

2 Answers2

9

You cannot trust non-optimized code for this.

Regarding optimization, the GCC compiler is probably throwing out the loop, since you don't do anything with the result. BTW b should not be a const double& reference but a const double.

If you want a meaningful benchmark, try storing b (or summing it). And make the number of iterations (50000) a runtime parameter (e.g. int nbiter = (argc>1)?atoi(argv[1]):1000;)

You might want to pass -O2 -ffast-math -mtune=native as optimizations flags to g++ (beware that -ffast-math is not standard compliant in the details of optimization)

With those flag a with my changes:

double sumtan=0.0, sumsincos=0.0;
int nbiter = argc>1?atoi(argv[1]):10000;
for (int i = 0; i < nbiter; ++i)
{
    const double & a = static_cast<double>(i);
    const double  b = std::tan(a);
    sumtan += b;
}
for (int i =  0; i < nbiter; ++i)
{
    const double & a = static_cast<double>(i);
    const double  b = std::sin(a) / std::cos(a);
    sumsincos += b;
}
std::cout << "tan : "  << elapsed_time_tan.count() 
          << " sumtan=" << sumtan << std::endl;

std::cout << "sincos : " << elapsed_time_sincos.count() 
          << " sumsincos=" << sumsincos << std::endl;

compiled with GCC 4.9.2 using

 g++ -std=c++11 -O2 -Wall -ffast-math -mtune=native b.cc -o b.bin

I'm getting quite similar timings:

  % ./b.bin 1000000
  tan :    77158579 sumtan=    -3.42432e+06
  sincos : 70219657 sumsincos= -3.42432e+06

this is on a 4 years old desktop (Intel(R) Xeon(R) CPU X3430 @ 2.40GHz)

If compiling with clang++ 3.5.0

tan :     78098229 sumtan=    -3.42432e+06
sincos : 106817614 sumsincos= -3.42432e+06

PS. Timing (and relative performance) is different with -O3. And some processors have machine instructions for sin, cos and tan but they might not be used (because the compiler or libm know that they are slower than a routine). GCC has builtins for these.

lys
  • 949
  • 2
  • 9
  • 33
Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • 2
    Or, make `b` `volatile`. – rubenvb Jan 06 '15 at 12:57
  • @rubenvb Couldn't this mess with the results somewhat? `volatile` has rather strict semantics with regards to reading & ordering. – Angew is no longer proud of SO Jan 06 '15 at 13:15
  • @Angew well, it forces the result to be written to RAM, but that overhead will be present for both cases. But yes, this might inhibit some vectorization of operations most probably. – rubenvb Jan 06 '15 at 13:18
  • _"BTW `b` should not be a `const double&` reference but a `double`."_ No, it should be a `const double`. You did get that right in your code examples, which is nice. – Lightness Races in Orbit Jan 06 '15 at 13:23
  • Easiest solution is to sum results to a temporary, and copy the temporary once to a `volatile`. This grants the optimizer a reasonable degree of freedom to reorder and vectorize things, but since the result is needed it can't discard the actual calls. – MSalters Jan 06 '15 at 14:12
3

Read the Intel developers manual. the trig functions are not as accurate aa the other math functions on the x86, so sin / cos will not give the same result as tan, which is something you should bear in mind if IEEE compliance is your reason for asking this.

As for the speed up, sin and cos can be obtained from the same instruction, so long as the compiler is not brain dead. Computing tan to the same accuracy is more work. The compiler can not therefore substitute sin/cos without breaking the standard.

Depending on whether these last decimal places matter to you or not, you may need to look at this What is the error of trigonometric instructions on x86?

Community
  • 1
  • 1
camelccc
  • 2,847
  • 8
  • 26
  • 52
  • Thanks for the precisions, I will definitely look at it. But as I am on a 64bits architecture, does it matter too ? – dkg Jan 06 '15 at 16:07
  • address bits of your architecture are irrelevant to the size of your fpu operands. There is another rabbit hole of 80 bit and 64 bit operands here since the trig functions use x87 instructions. – camelccc Jan 06 '15 at 20:06