4

I am searching for a very CPU efficient way to compute floating point modulus one (including negative values) in C. I am using it for normalized phase reduction (wrapping, i.e 7.6 -> 0.6, 0.2->0.2, -1.1 -> 0.9 and so on).

From what I understood, fmod() but also floor() are usually very inefficient. I don't need the function to be strict i.e to take into account nan or inf since I am responsible to pass valid values.

I have always been using

m = x - (float)(int)x;
m += (float)(m<0.f);
// branchless form to add one if m is negative but not zero

which from benchmarks is in general much more efficient than fmod() or using floor() in place of the int cast, but I was wondering if an even more efficient way existed, perhaps based on bit manipulations...

I am coding on 64 bits intel cpus with gcc, but for my purposes I am using 32 bits single precision floats.

I apologize if the same has been addressed somewhere else, but from my search I could not find anything about this specific topic.

EDIT: sorry I realized there was a subtle error in the originally posted code so I had to fix it. 1 must be added if the result (m) is negative, not if x was negative

EDIT2: actually, after benchmarking the same function using x-floor(x) instead of x-(float)(int)x on GCC 12 and all math optimizations on, I must say the former is faster, since GCC is evidently smart enough to replace the floor() inline function with very efficient code (this at least is true on my intel i7). This however may not always be the case with every cpu and compiler, since in other cases both floor() and fmod() are by personal experience very inefficient. Therefore, my quest for a bit manipulation or comparable trick which may result much faster and with every compiler and architecture still applies

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
elena
  • 233
  • 1
  • 7
  • Your algorithm does not give the same result as `fmod` for negative values. Is that on purpose? – Ted Lyngmo Aug 14 '22 at 11:43
  • What about this one? https://github.com/KnightOS/libc/blob/master/src/fmod.c – Robert Harvey Aug 14 '22 at 11:46
  • How big can the value become before you reduce it? Will you always know the sign? Can we take it as a prerequisite that the `float` format is IEEE-754 binary32 (the binary “single precision” format)? Do you have to exactly reduce it to the interval [0, 1) or would [0, 1] suffice? What about [0, 2) or [-1, 2] (that is, reduce the value modulo 1 so it does not grow indefinitely, but sometimes just reduce it a bunch, not completely)? What about [−½ , +½]? – Eric Postpischil Aug 14 '22 at 11:52
  • Do you mean something like this? https://godbolt.org/z/szEWrhK9f. I use doubles, because they efficient to use, as you see the compiler uses vectorized instructions. If you have to transform a lot of values, you might also have a look at AVX2 intrinsics (they can handle multiple values in parallel) – Pepijn Kramer Aug 14 '22 at 12:02
  • @PepijnKramer: No need to switch to `double` there - the `movsd` in your code is equivalent to a `movss`. Worse, if the compiler _can_ vectorize, you get a `movpd` (packed) versus a `movss`, and the latter processes twice as many elements. – MSalters Aug 14 '22 at 12:44
  • The "phase" bit suggests that `std::remquo` might be appropriate. – MSalters Aug 14 '22 at 12:55
  • I don't understand. Your title mentions the C language. The content mentions a non-existing language C/C++ and you have tagged C++. The C and C++ languages are distinct. For example C++ allows function overloading, namespaces and inheritance, C doesn't. Please update your post to the single language you are using. – Thomas Matthews Aug 14 '22 at 18:00
  • 2
    The expression `.5f - x - 0x1.5p23f + 0x1.5p23f + x` may work for you. For `x` in a reasonable range, `.5f - x - 0x1.5p23f` has a result in an interval where the IEEE-754 binary32 format represents only integers (the scale is so big no bits in the significand can scale to values less than one), and so, because of the `.5f`, it rounds down (in magnitude) from what `-x - 0x1.5p23f` would be. Then `+ 0x1.5p23f` removes the bias, and adding `x` cancels the integer part, leaving the fraction. This will sometimes produce 1 where the ideal result might be 0, but they are equivalent modulo 1. – Eric Postpischil Aug 14 '22 at 23:36
  • To those wondering about fmod(). I think I mentioned clearly enough that I need "wrapping" for normalized phase reduction. That is pretty a sawtooth waveform from -oo to +oo with y ranging from 0 to and excluding one, and with period of one. fmod() actually returns a waveform mirrored at the origin, and can't be used alone without some conditionals – elena Aug 15 '22 at 12:22
  • @RobertHarvey, the source you pointed to is still using a (int) cast like mine, which can be cpu demanding, depending on the architecture. That's why I was wondering if a more efficient method based on bit manipulations existed – elena Aug 15 '22 at 12:27
  • @EricPostpischil thanks for your suggestion, that is promising other than fully relevant to my question. Sounds very interesting and. I might choose this as a solution, possibly with some modification - still investigating – elena Aug 15 '22 at 12:38
  • @EricPostpischil btw the sign is always known. Range is 0 to-and-excluded one (exactly as, with phase reduction, range is usually zero to-and-excluding PI2). The original value range is not usually huge, it can be from -1000 to 1000 or so, often much less i.e high precision at huge values in not required. The float format is that of common Intel cpus since my software is intended for windows platforms (despite a Mac porting may be planned in future, by a friend of mine, but in that case I can simply use conditional compiling if needed) - I am not expert of the Mac world – elena Aug 15 '22 at 12:46
  • @PepijnKramer the code you pointed to also makes use of cast conversion like mine, which can afaik not always be very efficient on every cpu (but still ways faster than fmod(), usually) – elena Aug 15 '22 at 12:49
  • The only way you're going to know for sure is to run some performance tests. – Robert Harvey Aug 15 '22 at 13:56
  • @EricPostpischil after much testing I must say your trick is genial. It just needs additional code to force the result to 0 since it returns 1 on odd integers. So: x = .5f -x -0x1.5p23f +0x1.5p23f +x; return x -(float)(x==1.f); It is about 3 times faster than mine ! Only big drawback: one cannot use math optimizations like -ffast-math because it associates the elements, so it turns the experssion to just x= .5f ! – elena Aug 15 '22 at 22:18
  • 2
    Argh, I mixed decimal and hexadecimal; `0x1.5p23f` should be `0x1.8p23f`. `0x1.5p23f` will work fine within a large interval, but `0x1.8p23f` is the center of the binade, improving the coverage. – Eric Postpischil Aug 17 '22 at 00:38
  • @EricPostpischil thanks, I will try with 1.8. Btw have you any idea how to smartly write out the corresponding code so that no compiler would ever optimize the expression to just .5f? One option would be using gcc specific #pragma pop/push to disable certain optimizations locally, but it would not be portable in case my code should be used on another compiler... – elena Aug 19 '22 at 09:07
  • 2
    @elena: You can avoid the optimizations in a conforming compiler by writing `t = .5f - x; t -= 0x1.8p23f; t += 0x1.8p23f; t += x;`. C requires assignments and casts to “discard” excess precision, meaning the compiler cannot combine the arithmetic from consecutive assignments (and casts) into one expression that it can treat as evaluated mathematically with extra precision. However, one reason I have not written up this method as an answer is there are cases for which it is “wrong,” depending on what is acceptable in the application it is needed for… – Eric Postpischil Aug 19 '22 at 11:12
  • 1
    … For example, with 1−2^−24, the correct result is 2^−24, but the kludge produces 1, which you may convert to 0. This is one reason I asked abut a symmetric range, using [−1/2, +1/2], instead of [0, 1]. It may be easier to get good results with a symmetric range. – Eric Postpischil Aug 19 '22 at 11:14
  • @EricPostpischil using 1.8 vs 1.5 gives apparently the same result. Also, using consecutives assignments (something I had tried already btw) still results in the whole expression being evaluated as .5f, this at least if using -ffast-math in gcc. I can avoid -ffast-math even if it turned out useful somewhere else, but there is still the risk that a porting not done by me (eg on mac os with xcode) could still cause the same problem. – elena Aug 19 '22 at 18:46
  • Btw, which formula would you suggest for a range [-1/2 +1/2] ? – elena Aug 19 '22 at 18:48
  • 1
    @elena: For [−½, +½], you can use `float t = x + 0x1.8p23f; t -= 0x1.8p23f; return x-t;`, and testing shows this works for x in [−2^22, +2^22]. – Eric Postpischil Aug 19 '22 at 19:04

1 Answers1

1

A prototype in C++ (I am not up to date with C), the padding logic is still not optimized but if you have AVX512 on your system you could do something like this to process 8 doubles, or 16 floats in one loop. I found a lot of useful here : intrinsics cheat sheet

I used MSVC compiler from Visual Studio 2022

#include <type_traits>
#include <vector>
#include <immintrin.h>


void reduce_phases(std::vector<double>& inputs)
{
    static constexpr std::size_t vector_size = 512ul / sizeof(double);
    auto number_to_pad = vector_size - (inputs.size() % vector_size);
    inputs.insert(inputs.end(), number_to_pad, 0.0);

    auto data_ptr = inputs.data();
    
    for (std::size_t n{ 0ul }; n < inputs.size(); n += vector_size, data_ptr += vector_size)
    {
        auto values = _mm512_load_pd(data_ptr);
        auto floors = _mm512_floor_pd(values);
        auto result = _mm512_sub_pd(values, floors);
        _mm512_store_pd(data_ptr, result);
    }

    inputs.erase(inputs.end() - number_to_pad, inputs.end());
}

void reduce_phases(std::vector<float>& inputs)
{
    static constexpr std::size_t vector_size = 512ul / sizeof(float);

    auto number_to_pad = vector_size - (inputs.size() % vector_size);
    inputs.insert(inputs.end(), number_to_pad, 0.0);

    auto data_ptr = inputs.data();

    for (std::size_t n{ 0ul }; n < inputs.size(); n += vector_size, data_ptr += vector_size)
    {
        auto values = _mm512_load_ps(data_ptr);
        auto floors = _mm512_floor_ps(values);
        auto result = _mm512_sub_ps(values, floors);
        _mm512_store_ps(data_ptr, result);
    }

    inputs.erase(inputs.end() - number_to_pad, inputs.end());
}


int main()
{
    std::vector<double> values{ -1.1, -1.9, -1.5, -0.4, 0.0, 0.4, 1.5, 1.9, 2.1 };
    reduce_phases(values);

    std::vector<float> float_values{ -1.1f, -1.9f, -1.5f, -0.4f, 0.0f, 0.4f, 1.5f, 1.9f, 2.1f };
    reduce_phases(float_values);

    return 0;
}
Pepijn Kramer
  • 9,356
  • 2
  • 8
  • 19