17

I am currently writing some glsl like vector math classes in C++, and I just implemented an abs() function like this:

template<class T>
static inline T abs(T _a)
{
    return _a < 0 ? -_a : _a;
}

I compared its speed to the default C++ abs from math.h like this:

clock_t begin = clock();
for(int i=0; i<10000000; ++i)
{
    float a = abs(-1.25);
};

clock_t end = clock();
unsigned long time1 = (unsigned long)((float)(end-begin) / ((float)CLOCKS_PER_SEC/1000.0));

begin = clock();
for(int i=0; i<10000000; ++i)
{
    float a  = myMath::abs(-1.25);
};
end = clock();
unsigned long time2 = (unsigned long)((float)(end-begin) / ((float)CLOCKS_PER_SEC/1000.0));

std::cout<<time1<<std::endl;
std::cout<<time2<<std::endl;

Now the default abs takes about 25ms while mine takes 60. I guess there is some low level optimisation going on. Does anybody know how math.h abs works internally? The performance difference is nothing dramatic, but I am just curious!

chema989
  • 3,962
  • 2
  • 20
  • 33
moka
  • 263
  • 1
  • 3
  • 9
  • 3
    Perhaps you could show the assembly source generated by the compiler for the two loops and compare that. Otherwise, we're just guessing. – Greg Hewgill May 20 '10 at 21:47
  • 2
    `abs` takes an integer, and return an integer. So using `abs` is just wrong in your comparison. Use `fabs` from `` instead. – kennytm May 20 '10 at 21:50
  • And please, tell us what is your compiler/environment. – jpalecek May 20 '10 at 21:51
  • I am using GCC 4.2 in xCode on a intel Mac i7 – moka May 20 '10 at 21:57
  • okay, I made sure that I used std::abs now wich should be overloaded. to my suprise that one is much slower than mine and took 146ms – moka May 20 '10 at 22:04
  • @moka: now that's good! Your code is better than the implementation, hooray :) – jpalecek May 20 '10 at 22:34
  • ah, I am an idiot, I was in debug mode, now that I am in release both functions are identical in terms of speed :) – moka May 20 '10 at 22:51

8 Answers8

20

Since they are the implementation, they are free to make as many assumptions as they want. They know the format of the double and can play tricks with that instead.

Likely (as in almost not even a question), your double is the binary64 format. This means the sign has it's own bit, and an absolute value is merely clearing that bit. For example, as a specialization, a compiler implementer may do the following:

template <>
double abs<double>(const double x)
{
    // breaks strict aliasing, but compiler writer knows this behavior for the platform
    uint64_t i = reinterpret_cast<const std::uint64_t&>(x);
    i &= 0x7FFFFFFFFFFFFFFFULL; // clear sign bit

    return reinterpret_cast<const double&>(i);
}

This removes branching and may run faster.

GManNickG
  • 494,350
  • 52
  • 494
  • 543
  • Wouldn't that be a blatant example of undefined behaviour? – jpalecek May 20 '10 at 21:54
  • 7
    @jpalecek: I don't see any UB in there. `reinterpret_cast` has implementation-defined behavior. (Hence my point of, "they can make whatever implementation-specific assumptions they want, since they are the implementation.") Non-library coders cannot. Once you write this code, you are assuming certain things about your implementation. – GManNickG May 20 '10 at 21:56
  • 1
    If the double is not IEEE-754, yes. It is at the very least less portable – Yann Ramin May 20 '10 at 21:56
  • Couldn't you just return `a`? – fredoverflow May 20 '10 at 22:05
  • 5
    The point is that the implementation can write this code, but you can't. – rlbond May 20 '10 at 22:12
  • @Fred: I don't see what you mean. – GManNickG May 20 '10 at 22:19
  • @GMan: I'm not talking about `reinterpret_cast`, what I had in mind is 3.10.15 (strict aliasing violation): If a program attempts to access the stored value of an object through an lvalue of other than one of the following types the behavior is undefined: (here comes an enumeration of cases). Accessing a `double` though an lvalue of type `uint64_t` is not mentioned in the enumeration. – jpalecek May 20 '10 at 22:32
  • 3
    @Jpalecek: It may be undefined too you. But to the implementers of the library it is well defined (as the maths library is tightly coupled to the compiler). They know what the implementation is doing and as such can perform certain optimizations based on this knowledge. You on the other hand can not make any assumptions as your code is not coupled to the compiler. – Martin York May 20 '10 at 23:55
  • @jpalecek: As @Martin points out, there's a reason my entire answer starts with "Since they are the implementation, they are free to make as many assumptions as they want." – GManNickG May 21 '10 at 00:09
  • @Martin York: You seem to be missing the point - this behaviour is not "implementation-defined", it is undefined, and real-world compilers actually treat it as such. That means they can and do regularly compile code like this into a no-op, depending on optimisation level and zillion other things. Even eg. when inlined twice in the same program, it may work in one case and not the other. And there is nothing a library writer may do about it, except for writing the code correctly (which could be `((unsigned char*)&d)[7] &= 0x7f` on Intel machines). – jpalecek May 21 '10 at 00:26
  • @GMan: Maybe they can make assumptions about behaviour of the compiler above the C standard (which may or may not be wise), but you haven't made enough assumptions for your code to work. – jpalecek May 21 '10 at 00:30
  • @jpalecek: I edited my answer, is it clearer now? A compiler vendor can assume whatever they want because they are the compiler. When you compile for a specific platform, you *know* what assumptions you can and cannot make. – GManNickG May 21 '10 at 00:44
  • @GMan: No. This question is tagged c++, and if you want to make any assumptions beyond the standard, you have to write them, _specifically_. With your "compiler writer may assume anything doctrine", you may as well say that `d/0` returns the absolute value of `d`. I hope you agree this would be just BS, not a useful solution. – jpalecek May 21 '10 at 01:05
  • 1
    @jpalecek: I don't see how you fail to see the point of the question and answer. The question is "why is this implementation faster?" I'm answering with "because they can make whatever assumptions they want, here's an example." It doesn't get much simpler then that. Sure it's tagged C++ because the question is about C++. But it's also about implementing C++ compiler-wise. And yes, if `d/0` returns the absolute value on some platform, the compiler vendor can and should take advantage of that. I really don't get how hard it is to wrap your head around "implementors can assume whatever they want" – GManNickG May 21 '10 at 08:02
  • 4
    @jpalecek: Actually; I think it is you that seems to be missing the point. – Martin York May 21 '10 at 08:16
  • @moka I had a [similar question](http://stackoverflow.com/questions/39216210/why-cant-the-vs-2015-compiler-optimise-a-branch-in-an-abs-implementation-on-f) the other day. To quote IEEE 754 - 2008: `Comparisons shall ignore the sign of zero (so +0 = −0)`. Turns out that your suggestion and the OP's code are not equivalent. For -0.0 your version returns 0.0, while the OTs returns -0.0. I don't know which implementation is correct, but from what I saw in other librarys, I'd assume the OP's is invalid. – FirefoxMetzger Aug 30 '16 at 09:05
11

There are well-known tricks for computing the absolute value of a two's complement signed number. If the number is negative, flip all the bits and add 1, that is, xor with -1 and subtract -1. If it is positive, do nothing, that is, xor with 0 and subtract 0.

int my_abs(int x)
{
    int s = x >> 31;
    return (x ^ s) - s;
}
fredoverflow
  • 256,549
  • 94
  • 388
  • 662
  • floating point values are not stored in two's complement, but this is a clever trick for absolute value. (For determining sign, I do the same first step but then return `s | 1`; -1 = negative, +1 = positive) – dash-tom-bang May 20 '10 at 22:13
  • @jpal Fair enough, but where exactly does moka state he is only interested in floating point types? Maybe `template` should be replaced with `template` then, just to be clear on this? – fredoverflow May 21 '10 at 06:17
9

What is your compiler and settings? I'm sure MS and GCC implement "intrinsic functions" for many math and string operations.

The following line:

printf("%.3f", abs(1.25));

falls into the following "fabs" code path (in msvcr90d.dll):

004113DE  sub         esp,8 
004113E1  fld         qword ptr [__real@3ff4000000000000 (415748h)] 
004113E7  fstp        qword ptr [esp] 
004113EA  call        abs (4110FFh) 

abs call the C runtime 'fabs' implementation on MSVCR90D (rather large):

102F5730  mov         edi,edi 
102F5732  push        ebp  
102F5733  mov         ebp,esp 
102F5735  sub         esp,14h 
102F5738  fldz             
102F573A  fstp        qword ptr [result] 
102F573D  push        0FFFFh 
102F5742  push        133Fh 
102F5747  call        _ctrlfp (102F6140h) 
102F574C  add         esp,8 
102F574F  mov         dword ptr [savedcw],eax 
102F5752  movzx       eax,word ptr [ebp+0Eh] 
102F5756  and         eax,7FF0h 
102F575B  cmp         eax,7FF0h 
102F5760  jne         fabs+0D2h (102F5802h) 
102F5766  sub         esp,8 
102F5769  fld         qword ptr [x] 
102F576C  fstp        qword ptr [esp] 
102F576F  call        _sptype (102F9710h) 
102F5774  add         esp,8 
102F5777  mov         dword ptr [ebp-14h],eax 
102F577A  cmp         dword ptr [ebp-14h],1 
102F577E  je          fabs+5Eh (102F578Eh) 
102F5780  cmp         dword ptr [ebp-14h],2 
102F5784  je          fabs+77h (102F57A7h) 
102F5786  cmp         dword ptr [ebp-14h],3 
102F578A  je          fabs+8Fh (102F57BFh) 
102F578C  jmp         fabs+0A8h (102F57D8h) 
102F578E  push        0FFFFh 
102F5793  mov         ecx,dword ptr [savedcw] 
102F5796  push        ecx  
102F5797  call        _ctrlfp (102F6140h) 
102F579C  add         esp,8 
102F579F  fld         qword ptr [x] 
102F57A2  jmp         fabs+0F8h (102F5828h) 
102F57A7  push        0FFFFh 
102F57AC  mov         edx,dword ptr [savedcw] 
102F57AF  push        edx  
102F57B0  call        _ctrlfp (102F6140h) 
102F57B5  add         esp,8 
102F57B8  fld         qword ptr [x] 
102F57BB  fchs             
102F57BD  jmp         fabs+0F8h (102F5828h) 
102F57BF  mov         eax,dword ptr [savedcw] 
102F57C2  push        eax  
102F57C3  sub         esp,8 
102F57C6  fld         qword ptr [x] 
102F57C9  fstp        qword ptr [esp] 
102F57CC  push        15h  
102F57CE  call        _handle_qnan1 (102F98C0h) 
102F57D3  add         esp,10h 
102F57D6  jmp         fabs+0F8h (102F5828h) 
102F57D8  mov         ecx,dword ptr [savedcw] 
102F57DB  push        ecx  
102F57DC  fld         qword ptr [x] 
102F57DF  fadd        qword ptr [__real@3ff0000000000000 (1022CF68h)] 
102F57E5  sub         esp,8 
102F57E8  fstp        qword ptr [esp] 
102F57EB  sub         esp,8 
102F57EE  fld         qword ptr [x] 
102F57F1  fstp        qword ptr [esp] 
102F57F4  push        15h  
102F57F6  push        8    
102F57F8  call        _except1 (102F99B0h) 
102F57FD  add         esp,1Ch 
102F5800  jmp         fabs+0F8h (102F5828h) 
102F5802  mov         edx,dword ptr [ebp+0Ch] 
102F5805  and         edx,7FFFFFFFh 
102F580B  mov         dword ptr [ebp-0Ch],edx 
102F580E  mov         eax,dword ptr [x] 
102F5811  mov         dword ptr [result],eax 
102F5814  push        0FFFFh 
102F5819  mov         ecx,dword ptr [savedcw] 
102F581C  push        ecx  
102F581D  call        _ctrlfp (102F6140h) 
102F5822  add         esp,8 
102F5825  fld         qword ptr [result] 
102F5828  mov         esp,ebp 
102F582A  pop         ebp  
102F582B  ret   

In release mode, the FPU FABS instruction is used instead (takes 1 clock cycle only on FPU >= Pentium), the dissasembly output is:

00401006  fld         qword ptr [__real@3ff4000000000000 (402100h)] 
0040100C  sub         esp,8 
0040100F  fabs             
00401011  fstp        qword ptr [esp] 
00401014  push        offset string "%.3f" (4020F4h) 
00401019  call        dword ptr [__imp__printf (4020A0h)] 
Hernán
  • 4,527
  • 2
  • 32
  • 47
4

It probably just uses a bitmask to set the sign bit to 0.

sepp2k
  • 363,768
  • 54
  • 674
  • 675
3

There can be several things:

  • are you sure the first call uses std::abs? It could just as well use the integer abs from C (either call std::abs explicitely, or have using std::abs;)

  • the compiler might have intrinsic implementation of some float functions (eg. compile them directly into FPU instructions)

However, I'm surprised the compiler doesn't eliminate the loop altogether - since you don't do anything with any effect inside the loop, and at least in case of abs, the compiler should know there are no side-effects.

jpalecek
  • 47,058
  • 7
  • 102
  • 144
3

Your version of abs is inlined and can be computed once and the compiler can trivially know that the value returned isn't going to change, so it doesn't even need to call the function.

You really need to look at the generated assembly code (set a breakpoint, and open the "large" debugger view, this disassembly will be on the bottom left if memory serves), and then you can see what's going on.

You can find documentation on your processor online without too much trouble, it'll tell you what all of the instructions are so you can figure out what's happening. Alternatively, paste it here and we'll tell you. ;)

dash-tom-bang
  • 17,383
  • 5
  • 46
  • 62
1

Probably the library version of abs is an intrinsic function, whose behavior is exactly known by the compiler, which can even compute the value at compile time (since in your case it's known) and optimize the call away. You should try your benchmark with a value known only at runtime (provided by the user or got with rand() before the two cycles).

If there's still a difference, it may be because the library abs is written directly in hand-forged assembly with magic tricks, so it could be a little faster than the generated one.

Matteo Italia
  • 123,740
  • 17
  • 206
  • 299
0

The library abs function operates on integers while you are obviously testing floats. This means that call to abs with float argument involves conversion from float to int (may be a no-op as you are using constant and compiler may do it at compile time), then INTEGER abs operation and conversion int->float. You templated function will involve operations on floats and this is probably making a difference.

Tomek
  • 4,554
  • 1
  • 19
  • 19
  • `std::abs` exists also for floats. See https://en.cppreference.com/w/cpp/numeric/math/fabs – VLL Jan 10 '20 at 13:19