Is there a fast fabsf replacement for "float" in C++?

Question

I'm just doing some benchmarking and found out that fabsf() is often like 10x slower than fabs(). So I disassembled it and it turns out the double version is using fabs instruction, float version is not. Can this be improved? This is faster, but not so much and I'm afraid it may not work, it's a little too lowlevel:

float mabs(float i)
{
    (*reinterpret_cast<MUINT32*>(&i)) &= 0x7fffffff;
    return i;
}

Edit: Sorry forgot about the compiler - I still use the good old VS2005, no special libs.

You should mention which compiler/library you are using because this is a library implementation detail. — Dale Wilson, May 05 '14 at 14:27
If `fabs` is really that much faster, maybe you could benchmark `(float)fabs((double)floatval)` and see if it's better than `fabsf`. — eerorika, May 05 '14 at 14:32
I actually tried benchmarking the fabs((double)x), but it was pretty slow really. It's really interesting :). — Vojtěch Melda Meluzín, May 05 '14 at 15:47
Frankly I'm looking at the documentation for fabs in C++ and it can actually be applied to double, float etc. (method overloading). http://www.cplusplus.com/reference/cmath/fabs/. Do you have to stick to pure C or you can add some C++ into the mix? Btw "I'm afraid it may not work" makes me wonder what exactly are you trying to achieve here that you have such concerns. Providing this information may prove to be quite beneficial for you. PS: totally agree with @Abyx on the VS2005 issue. — rbaleksandar, Nov 22 '14 at 13:12

rubenvb · Answer 1 · 2014-05-05T15:26:44.587

You can easily test different possibilities using the code below. It essentially tests your bitfiddling against naive template abs, and std::abs. Not surprisingly, naive template abs wins. Well, kind of surprisingly it wins. I'd expect std::abs to be equally fast. Note that -O3 actually makes things slower (at least on coliru).

Coliru's host system shows these timings:

random number generation: 4240 ms
naive template abs: 190 ms
ugly bitfiddling abs: 241 ms
std::abs: 204 ms
::fabsf: 202 ms

And these timings for a Virtualbox VM running Arch with GCC 4.9 on a Core i7:

random number generation: 1453 ms
naive template abs: 73 ms
ugly bitfiddling abs: 97 ms
std::abs: 57 ms
::fabsf: 80 ms

And these timings on MSVS2013 (Windows 7 x64):

random number generation: 671 ms
naive template abs: 59 ms
ugly bitfiddling abs: 129 ms
std::abs: 109 ms
::fabsf: 109 ms

If I haven't made some blatantly obvious mistake in this benchmark code (don't shoot me over it, I wrote this up in about 2 minutes), I'd say just use std::abs, or the template version if that turns out to be slightly faster for you.

The code:

#include <algorithm>
#include <cmath>
#include <cstdint>
#include <cstdlib>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>

#include <math.h>

using Clock = std::chrono::high_resolution_clock;
using milliseconds = std::chrono::milliseconds;

template<typename T>
T abs_template(T t)
{
  return t>0 ? t : -t;
}

float abs_ugly(float f)
{
  (*reinterpret_cast<std::uint32_t*>(&f)) &= 0x7fffffff;
  return f;
}

int main()
{
  std::random_device rd;
  std::mt19937 mersenne(rd());
  std::uniform_real_distribution<> dist(-std::numeric_limits<float>::lowest(), std::numeric_limits<float>::max());

  std::vector<float> v(100000000);

  Clock::time_point t0 = Clock::now();

  std::generate(std::begin(v), std::end(v), [&dist, &mersenne]() { return dist(mersenne); });

  Clock::time_point trand = Clock::now();

  volatile float temp;
  for (float f : v)
    temp = abs_template(f);

  Clock::time_point ttemplate = Clock::now();

  for (float f : v)
    temp = abs_ugly(f);

  Clock::time_point tugly = Clock::now();

  for (float f : v)
    temp = std::abs(f);

  Clock::time_point tstd = Clock::now();

  for (float f : v)
    temp = ::fabsf(f);

  Clock::time_point tfabsf = Clock::now();

  milliseconds random_time = std::chrono::duration_cast<milliseconds>(trand - t0);
  milliseconds template_time = std::chrono::duration_cast<milliseconds>(ttemplate - trand);
  milliseconds ugly_time = std::chrono::duration_cast<milliseconds>(tugly - ttemplate);
  milliseconds std_time = std::chrono::duration_cast<milliseconds>(tstd - tugly);
  milliseconds c_time = std::chrono::duration_cast<milliseconds>(tfabsf - tstd);
  std::cout << "random number generation: " << random_time.count() << " ms\n"
    << "naive template abs: " << template_time.count() << " ms\n"
    << "ugly bitfiddling abs: " << ugly_time.count() << " ms\n"
    << "std::abs: " << std_time.count() << " ms\n"
    << "::fabsf: " << c_time.count() << " ms\n";
}

Oh, and to answer your actual question: if the compiler can't generate more efficient code, I doubt there is a faster way save for micro-optimized assembly, especially for elementary operations such as this.

I know that if you used a C style array instead of a vector VS2013 should generate vector code, so that should probably be tested as well (if only to see how the SSE version compares) — Mgetz, May 05 '14 at 15:47
Actually I just tried the naive code and it wasn't a win really. Right now this is the slowest one, fastest is fabs with retyping to double and back. — Vojtěch Melda Meluzín, May 05 '14 at 15:55
Actually I wanted to measure really just the single call, no vectors, I have IPP for that. — Vojtěch Melda Meluzín, May 05 '14 at 15:56
@Mgetz that would mean VS2013 sucks at optimization. Note I can get VS to reach the naive template abs speed when I enable `/fp:fast` though I doubt that is what you'd want. — rubenvb, May 06 '14 at 07:52

zneak · Answer 2 · 2014-05-05T16:12:21.590

There are many things at play here. First off, the x87 co-processor is deprecated in favor of SSE/AVX, so I'm surprised to read that your compiler still uses the fabs instruction. It's quite possible that the others who posted benchmark answers on this question use a platform that supports SSE. Your results might be wildly different.

I'm not sure why your compiler uses a different logic for fabs and fabsf. It's totally possible to load a float to the x87 stack and use the fabs instruction on it just as easily. The problem with reproducing this by yourself, without compiler support, is that you can't integrate the operation into the compiler's normal optimizing pipeline: if you say "load this float, use the fabs instruction, return this float to memory", then the compiler will do exactly that... and it may involve putting back to memory a float that was already ready to be processed, loading it back in, using the fabs instruction, putting it back to memory, and loading it again to the x87 stack to resume the normal, optimizable pipeline. This would be four wasted load-store operations because it only needed to do fabs.

In short, you are unlikely to beat integrated compiler support for floating-point operations. If you don't have this support, inline assembler might just make things even slower than they presumably already are. The fastest thing for you to do might even be to use the fabs function instead of the fabsf function on your floats.

For reference, modern compilers and modern platforms use the SSE instructions andps (for floats) and andpd (for doubles) to AND out the bit sign, very much like you're doing yourself, but dodging all the language semantics issues. They're both as fast. Modern compilers may also detect patterns like x < 0 ? -x : x and produce the optimal andps/andpd instruction without the need for a compiler intrinsic.

Thank you for exhaustive answer. Still on VC2005 here, but I'm thinking would it make sense to upgrade? I mean what are the performance improvements of the new compilers? And is it even safe to generate code with SSE when you assume some computers using your software may not even support it? — Vojtěch Melda Meluzín, May 05 '14 at 15:58
@VojtěchMeldaMeluzín, `andps` and `andpd` are from SSE2, which was introduced on Pentium IVs, so I'd say that it's pretty safe to use now (still know a lot of people using P3s?). As for performance, compiler technology has improved a lot in the last 9 years and you're likely to find improvements in many areas beyond floating-point handling. — zneak, May 05 '14 at 16:01
@VojtěchMeldaMeluzín [Steam has good stats on who has what](http://store.steampowered.com/hwsurvey/) SSE, SSE2, and even SSE3 are pretty common at this point 90%+ — Mgetz, May 05 '14 at 16:03
@Mgetz Steam offers only a subset of PCs, which may or may not be the target audience. In a corporate setting, Steam stats are worthless. — rubenvb, May 06 '14 at 07:58
@rubenvb any poll of hardware is inherently biased, however in my experience the number of corporate environments still running anything older than a P4 is close to zero. — Mgetz, May 06 '14 at 11:12
Both true, but if SSE2 is there since P4, then I guess I can safely take the risk. — Vojtěch Melda Meluzín, May 06 '14 at 16:06

Mark B · Answer 3 · 2014-05-05T16:13:21.110

2

Did you try the std::abs overload for float? That would be the canonical C++ way.

Also as an aside, I should note that your bit-modifying version does violate the strict-aliasing rules (in addition to the more fundamental assumption that int and float have the same size) and as such would be undefined behavior.

edited May 05 '14 at 16:13

answered May 05 '14 at 14:22

Mark B

95,107
10
109
188

[Documentation for `std::abs`](http://en.cppreference.com/w/cpp/numeric/math/fabs) – Mgetz May 05 '14 at 14:23
Well, that would presumably do what `fabsf` in C does. Which OP wants to avoid. – May 05 '14 at 14:23
@delnan actually it wouldn't because C does not accept overloads based on parameters, so by default it cannot be the C function. – Mgetz May 05 '14 at 14:24
3

@Mgetz I did not say it would be the C function. I said it would *do* the same thing. That is, its *implementation* probably delegates to `fabsf` or is identical (`__builtin_something`). – May 05 '14 at 14:27
1

@delnan possible that is an implementation detail though, it could also use a template – Mgetz May 05 '14 at 14:36

Is there a fast fabsf replacement for "float" in C++?

3 Answers3

Linked