How to optimize a simple numeric type wrapper class in C++?

Question

I am trying to implement a fixed-point class in C++, but I face problems with performance. I have reduced the problem to a simple wrapper of the float type and it is still slow. My question is - why is the compiler unable optimize it fully?

The 'float' version is 50% faster than 'Float'. Why?!

(I use Visual C++ 2008, all possible compiler's options tested, Release configuration of course).

See the code below:

#include <cstdio>
#include <cstdlib>
#include "Clock.h"      // just for measuring time

#define real Float      // Option 1
//#define real float        // Option 2

struct Float
{
private:
    float value;

public:
    Float(float value) : value(value) {}
    operator float() { return value; }

    Float& operator=(const Float& rhs)
    {
        value = rhs.value;
        return *this;
    }

    Float operator+ (const Float& rhs) const
    {
        return Float( value + rhs.value );
    }

    Float operator- (const Float& rhs) const
    {
        return Float( value - rhs.value );
    }

    Float operator* (const Float& rhs) const
    {
        return Float( value * rhs.value );
    }

    bool operator< (const Float& rhs) const
    {
        return value < rhs.value;
    }
};

struct Point
{
    Point() : x(0), y(0) {}
    Point(real x, real y) : x(x), y(y) {}

    real x;
    real y;
};

int main()
{
    // Generate data
    const int N = 30000;
    Point points[N];
    for (int i = 0; i < N; ++i)
    {
        points[i].x = (real)(640.0f * rand() / RAND_MAX);
        points[i].y = (real)(640.0f * rand() / RAND_MAX);
    }

    real limit( 20 * 20 );

    // Check how many pairs of points are closer than 20
    Clock clk;

    int count = 0;
    for (int i = 0; i < N; ++i)
    {
        for (int j = i + 1; j < N; ++j)
        {
            real dx = points[i].x - points[j].x;
            real dy = points[i].y - points[j].y;
            real d2 = dx * dx + dy * dy;
            if ( d2 < limit )
            {
                count++;
            }
        }
    }

    double time = clk.time();

    printf("%d\n", count);
    printf("TIME: %lf\n", time);

    return 0;
}

Have you turned on the maximum optimization flags. I have seen magics happening when you turn them ON. — iammilind, Jul 19 '11 at 09:02
Generate the assembly and check where the differences lay... — Matthieu M., Jul 19 '11 at 09:09
By the way, if your class implements fixed point arithmetic instead of floating point, could you please name it something like "Fixed" instead of "Float"? Your coworkers will thank you. — R. Martinho Fernandes, Jul 19 '11 at 09:14
I have played with many different optimization options without an effect. I have looked at the assembly - there are no 'calls', so inlining works. There are just more instructions. See: https://www.future-processing.com/~mczardybon/FloatVSfloat.png. — Michal Czardybon, Jul 19 '11 at 09:20

score 4 · Accepted Answer · answered Jul 19 '11 at 09:12

4

IMO, It has to do with optimization flags. I checked your program in g++ linux-64 machine. Without any optimization, it give the same result as you told which 50% less.

With keeping the maximum optimization turned ON (i.e. -O4). Both versions are same. Turn on the optimization and check.

answered Jul 19 '11 at 09:12

iammilind

68,093
33
169
336

1

I have installed GCC and in fact it works well! With GCC the time is 1.13 s, whereas with VC++ it is 1.70 s (float) or 2.58 s (Float). I also discovered that moving 'dx * dx + dy * dy' directly to the condition improves performance on VC++ by 21%! How is it possible that VC++ optimizes so poorly?! I have all possible optimization options turned on and tested many different combinations. – Michal Czardybon Jul 19 '11 at 10:51
Wow... When I switched from 'Win32' to 'x64' platform the execution time dropped down from 2.58 to 0.77 s! And it is the same for 'float' and for 'Float'. – Michal Czardybon Jul 19 '11 at 12:49

score 4 · Answer 2 · answered Jul 19 '11 at 09:25

Try not passing by reference. Your class is small enough that the overhead of passing it by reference (yes there is overhead if the compiler doesn't optimize it out), might be higher than just copying the class. So this...

Float operator+ (const Float& rhs) const
{
   return Float( value + rhs.value );
}

becomes something like this...

Float operator+ (Float rhs) const
{
   rhs.value+=value;
   return rhs;
}

which avoids a temporary object and may avoid some indirection of a pointer dereference.

I tried it - does not work. It even increases time by further 59%. — Michal Czardybon, Jul 19 '11 at 10:40

Captain Obvlious · Answer 3 · 2011-07-19T11:11:32.560

After further investigation I am thoroughly convinced this is an issue with the optimization pipeline of the compiler. The code generated in this instance is significantly bad in comparison to using a non-encapsulated float. My suggestion is to report this potential issue to Microsoft and see what they have to say about it. I also suggest that you move on to implementing your planned fixed point version of this class as the code generated for integers appears optimal.

How to optimize a simple numeric type wrapper class in C++?

3 Answers3