C++ Low latency Design: Function Dispatch v/s CRTP for Factory implementation

Question

As part of a system design, we need to implement a factory pattern. In combination with the Factory pattern, we are also using CRTP, to provide a base set of functionality which can then be customized by the Derived classes.

Sample code below:

class FactoryInterface{
    public:
     virtual void doX() = 0;
};

//force all derived classes to implement custom_X_impl 
template< typename Derived, typename Base = FactoryInterface>
class CRTP : public Base 
{
    public:
    void doX(){
        // do common processing..... then
        static_cast<Derived*>(this)->custom_X_impl();
    }
};

class Derived: public CRTP<Derived>
{
    public:
        void custom_X_impl(){
        //do custom stuff
        }
};

Although this design is convoluted, it does a provide a few benefits. All the calls after the initial virtual function call can be inlined. The derived class custom_X_impl call is also made efficiently.

I wrote a comparison program to compare the behavior for a similar implementation (tight loop, repeated calls) using function pointers and virtual functions. This design came out triumphs for gcc/4.8 with O2 and O3.

A C++ guru however told me yesterday, that any virtual function call in a large executing program can take a variable time, considering cache misses and I can achieve a potentially better performance using C style function table look-ups and gcc hotlisting of functions. However I still see 2x the cost in my sample program mentioned above.

My questions are as below: 1. Is the guru's assertion true? For either answers, are there any links I can refer. 2. Is there any low latency implementation which I can refer, has a base class invoking a custom function in a derived class, using function pointers? 3. Any suggestions on improving the design?

Any other feedback is always welcome.

Christophe · Accepted Answer · 2015-03-14T18:46:36.670

Your guru refers to the hot attribute of the gcc compiler. The effect of this attribute is:

The function is optimized more aggressively and on many targets it is placed into a special subsection of the text section so all hot functions appear close together, improving locality.

So yes, in a very large code base, the hotlisted function may remain in cache ready to be executed without delay, because it avodis cache misses.

You can perfectly use this attribute for member functions:

struct X {
    void test()  __attribute__ ((hot)) {cout <<"hello, world !\n"; }
};

But...

When you use virtual functions the compiler generally generates a vtable that is shared between all objects of the class. This table is a table of pointers to functions. And indeed -- your guru is right -- nothing garantees that this table remains in cached memory.

But, if you manually create a "C-style" table of function pointers, the problem is EXACTLY THE SAME. While the function may remain in cache, nothing ensures that your function table remains in cache as well.

The main difference between the two approaches is that:

in the case of virtual functions, the compiler knows that the virtual function is a hot spot, and could decide to make sure to keep the vtable in cache as well (I don't know if gcc can do this or if there are plans to do so).
in the case of the manual function pointer table, your compiler will not easily deduce that the table belongs to a hot spot. So this attempt of manual optimization might very well backfire.

My opinion: never try to optimize yourself what a compiler can do much better.

Conclusion

Trust in your benchmarks. And trust your OS: if your function or your data is frequently acessed, there are high chances that a modern OS will take this into account in its virtual memry management, and whatever the compiler will generate.

Thanks Christophe. Your analysis ties in with what I see in the performance tests as well. Do you see any potential optimizations in the design? — Sid, Mar 14 '15 at 18:46
I think it'll be difficult to optimize further ! As doX() has to be virtual, and Derived is customer defined, I find no other alternative with less indirection. — Christophe, Mar 14 '15 at 19:26

C++ Low latency Design: Function Dispatch v/s CRTP for Factory implementation

1 Answers1

But...

Conclusion