47

I have a few heavily optimized math functions that take 1-2 nanoseconds to complete. These functions are called hundreds of millions of times per second, so call overhead is a concern, despite the already-excellent performance.

In order to keep the program maintainable, the classes that provide these methods inherit an IMathFunction interface, so that other objects can directly store a specific math function and use it when needed.

public interface IMathFunction
{
  double Calculate(double input);
  double Derivate(double input);
}

public SomeObject
{
  // Note: There are cases where this is mutable
  private readonly IMathFunction mathFunction_; 

  public double SomeWork(double input, double step)
  {
    var f = mathFunction_.Calculate(input);
    var dv = mathFunction_.Derivate(input);
    return f - (dv * step);
  }
}

This interface is causing an enormous overhead compared to a direct call due to how the consuming code uses it. A direct call takes 1-2ns, whereas the virtual interface call takes 8-9ns. Evidently, the presence of the interface and its subsequent translation of the virtual call is the bottleneck for this scenario.

I would like to retain both maintainability and performance if possible. Is there a way I can resolve the virtual function to a direct call when the object is instantiated so that all subsequent calls are able to avoid the overhead? I assume this would involve creating delegates with IL, but I wouldn't know where to start with that.

Haus
  • 1,492
  • 7
  • 23
  • 5
    How do you measure the nano second timing? – z3nth10n Dec 14 '18 at 19:38
  • 3
    @z3nth10n BenchmarkDotNet. It warms up and pre-JIT's everything before benchmarking. This is in release mode. Profiling with dotTrace also shows similar results. – Haus Dec 14 '18 at 19:39
  • Does `[MethodImpl(MethodImplOptions.AggressiveInlining)]` make a difference at all? –  Dec 14 '18 at 19:40
  • Thanks, I would like to use this in my projects, because I'm on a similar situation. – z3nth10n Dec 14 '18 at 19:40
  • @JeroenMostert All math functions (47 of them) provide `Calculate()` and `Derivate()`. I want objects to be able to use any one of these at any time. The classes are pre-instantiated as static fields. – Haus Dec 14 '18 at 19:41
  • 1
    Have you looked at the generated IL. I'm not an expert in this regard, but my understanding is that both virtual and non-virtual calls in C# and VB use the `callvirt` instruction (in order to properly fail with the object instance associated with the call is `null`). Your question surprised me. – Flydog57 Dec 14 '18 at 19:46
  • 1
    Is there any feasible way you can offload the core of the math functions to C or C++? That way you can have a struct with a function pointer in it, which you'd only have to fill once, and you'd never see a vtable lookup (well, in C, not sure about C++). When we're talking nanosecond-level timings, it could be worth it. – Chris Akridge Dec 14 '18 at 19:47
  • 1
    @Flydog57: You are correct that the IL generated is often callvirt when a null check is needed even if the method is not virtual. But the expense here is almost certainly the virtual indirection, which will not happen if the member is not actually virtual. – Eric Lippert Dec 14 '18 at 19:47
  • It's taking extra time because you are instantiating an object and not just calling your math function. Have you considered using a functional programming language? F#? – Glenn Ferrie Dec 14 '18 at 19:49
  • 4
    Cory Nelson's solution is good and that is what I would pursue. It might be worthwhile however to do a quick check and see what the performance of saving the function into a delegate is, and then invoking the delegate. That also has some indirection, but it might be slightly smaller than the interface indirection. – Eric Lippert Dec 14 '18 at 19:49
  • @ChrisAkridge It is a possibility. Your suggestion also gave me the idea of turning `IMathFunction` into a struct with function pointers, and then have each math function as a static value of that struct. I am going to try that first. – Haus Dec 14 '18 at 19:50
  • @Haus C#'s equivalent of function pointers would be the `Func` delegate type. I don't know if there's a way to get C-style function pointers in `unsafe` mode. I'll take a look at what Try Roslyn spits out IL-wise, as delegate invocation isn't really that cheap, either. – Chris Akridge Dec 14 '18 at 19:53
  • @Haus: So, basically, implement the vtable of the interface but without any of the overhead of the interface runtime type information. Good idea. – Eric Lippert Dec 14 '18 at 19:53
  • 1
    @ChrisAkridge: Unfortunately, there's a fair amount of overhead associated with native calls also. I wrote a prototype of a C# compiler feature whereby you could use the `calli` instruction -- one of the few IL instructions that the C# compiler never generates -- to do a higher-performance call to a native function pointer, but it was never adopted into mainstream C#. A bunch of people who used to work on research features like that are now on the C# team, so perhaps that work will be revived. – Eric Lippert Dec 14 '18 at 19:55
  • 1
    @ChrisAkridge: That said, there is nothing stopping you from using Reflection Emit to make a static method that does a `calli` to the unmanaged pointer of your choice. Or write the method in CIL and ILASM it into its own assembly. – Eric Lippert Dec 14 '18 at 19:57
  • Yeah, I just looked out at SharpLab. The call to a delegate emits a `callvirt` to the Func's `Invoke` method, which adds at least one layer of indirection and will probably take quite a bit longer. With what @EricLippert said about native calls, @Haus, you might also want to offload the processing loop to C, should you choose to move the math code out. Or emit `calli`, as Eric said. – Chris Akridge Dec 14 '18 at 19:58
  • For the cases where mathFunction is mutable, how/where is it changed? – bcwhims Dec 14 '18 at 21:17
  • @bcwhims `mathFunction` is mutable when it is being used as a part of a genetic algorithm. Usually in that case a different `IMathFunction` is selected at random and applied to each offspring. – Haus Dec 14 '18 at 21:20
  • If the problem you solve with the interface is indeed just (programming/compile time) maintainability, you could consider generating (interface-free) source code as needed. The abstraction and maintainability would be in the generator (probably written in a script language or possibly in C# as well) plus configuration data. – Peter - Reinstate Monica Dec 15 '18 at 00:30
  • Another thing I'd be curious about are the reasons to do this performance critical work in C#. I'm not partial but just curious: Do you have comparisons with C or possibly Fortran? – Peter - Reinstate Monica Dec 15 '18 at 00:36
  • You can merge all the `Calculate` functions into a single unified function with an additional integer argument that specifies which of the individual `Calculate` functions to execute. `mathFunction_` would then hold the number that identifies the function to be called and it is passed to the unified function. The unified function can either use a switch statement of a series of if/else statements (you can test both). Optimize the unified function with managed profile guided optimization (MPGO). The `Derivate` functions can be handled similarly. BTW, how many `IMathFunction` implts are there? – Hadi Brais Dec 15 '18 at 16:29
  • how many implementations do you have for `IMathFunction`? – Ron Klein Dec 24 '18 at 05:18

2 Answers2

52

You can cause the JIT to devirtualize your interface calls by using a struct with a constrained generic.

public SomeObject<TMathFunction> where TMathFunction: struct, IMathFunction 
{
  private readonly TMathFunction mathFunction_;

  public double SomeWork(double input, double step)
  {
    var f = mathFunction_.Calculate(input);
    var dv = mathFunction_.Derivate(input);
    return f - (dv * step);
  }
}

// ...

var obj = new SomeObject<CoolMathFunction>();
obj.SomeWork(x, y);

Here are the important pieces to note:

  • The implementation of the IMathFunction interface, CoolMathFunction, is known at compile-time through a generic. This limits the applicability of this optimization quite a bit.
  • A generic parameter type TMathFunction is called directly rather than the interface IMathFunction.
  • The generic is constrained to implement IMathFunction so we can call those methods.
  • The generic is constrained to a struct -- not strictly a requirement, but to ensure we correctly exploit how the JIT generates codes for generics: the code will still run, but we won't get the optimization we want without a struct.

When generics are instantiated, codegen is different depending on the generic parameter being a class or a struct. For classes, every instantiation actually shares the same code and is done through vtables. But structs are special: they get their own instantiation that devirtualizes the interface calls into calling the struct's methods directly, avoiding any vtables and enabling inlining.

This feature exists to avoid boxing value types into reference types every time you call a generic. It avoids allocations and is a key factor in List<T> etc. being an improvement over the non-generic List etc.

Some implementation:

I made a simple implementation of IMathFunction for testing:

class SomeImplementationByRef : IMathFunction
{
    public double Calculate(double input)
    {
        return input + input;
    }

    public double Derivate(double input)
    {
        return input * input;
    }
}

... as well as a struct version and an abstract version.

So, here's what happens with the interface version. You can see it is relatively inefficient because it performs two levels of indirection:

    return obj.SomeWork(input, step);
sub         esp,40h  
vzeroupper  
vmovaps     xmmword ptr [rsp+30h],xmm6  
vmovaps     xmmword ptr [rsp+20h],xmm7  
mov         rsi,rcx
vmovsd      qword ptr [rsp+60h],xmm2  
vmovaps     xmm6,xmm1
mov         rcx,qword ptr [rsi+8]          ; load mathFunction_ into rcx.
vmovaps     xmm1,xmm6  
mov         r11,7FFED7980020h              ; load vtable address of the IMathFunction.Calculate function.
cmp         dword ptr [rcx],ecx  
call        qword ptr [r11]                ; call IMathFunction.Calculate function which will call the actual Calculate via vtable.
vmovaps     xmm7,xmm0
mov         rcx,qword ptr [rsi+8]          ; load mathFunction_ into rcx.
vmovaps     xmm1,xmm6  
mov         r11,7FFED7980028h              ; load vtable address of the IMathFunction.Derivate function.
cmp         dword ptr [rcx],ecx  
call        qword ptr [r11]                ; call IMathFunction.Derivate function which will call the actual Derivate via vtable.
vmulsd      xmm0,xmm0,mmword ptr [rsp+60h] ; dv * step
vsubsd      xmm7,xmm7,xmm0                 ; f - (dv * step)
vmovaps     xmm0,xmm7  
vmovaps     xmm6,xmmword ptr [rsp+30h]  
vmovaps     xmm7,xmmword ptr [rsp+20h]  
add         rsp,40h  
pop         rsi  
ret  

Here's an abstract class. It's a little more efficient but only negligibly:

        return obj.SomeWork(input, step);
 sub         esp,40h  
 vzeroupper  
 vmovaps     xmmword ptr [rsp+30h],xmm6  
 vmovaps     xmmword ptr [rsp+20h],xmm7  
 mov         rsi,rcx  
 vmovsd      qword ptr [rsp+60h],xmm2  
 vmovaps     xmm6,xmm1  
 mov         rcx,qword ptr [rsi+8]           ; load mathFunction_ into rcx.
 vmovaps     xmm1,xmm6  
 mov         rax,qword ptr [rcx]             ; load object type data from mathFunction_.
 mov         rax,qword ptr [rax+40h]         ; load address of vtable into rax.
 call        qword ptr [rax+20h]             ; call Calculate via offset 0x20 of vtable.
 vmovaps     xmm7,xmm0  
 mov         rcx,qword ptr [rsi+8]           ; load mathFunction_ into rcx.
 vmovaps     xmm1,xmm6  
 mov         rax,qword ptr [rcx]             ; load object type data from mathFunction_.
 mov         rax,qword ptr [rax+40h]         ; load address of vtable into rax.
 call        qword ptr [rax+28h]             ; call Derivate via offset 0x28 of vtable.
 vmulsd      xmm0,xmm0,mmword ptr [rsp+60h]  ; dv * step
 vsubsd      xmm7,xmm7,xmm0                  ; f - (dv * step)
 vmovaps     xmm0,xmm7
 vmovaps     xmm6,xmmword ptr [rsp+30h]  
 vmovaps     xmm7,xmmword ptr [rsp+20h]  
 add         rsp,40h  
 pop         rsi  
 ret  

So both an interface and an abstract class rely heavily on branch target prediction to have acceptable performance. Even then, you can see there's quite a lot more going into it, so the best-case is still relatively slow while the worst-case is a stalled pipeline due to a mispredict.

And finally here's the generic version with a struct. You can see it's massively more efficient because everything has been fully inlined so there's no branch prediction involved. It also has the nice side effect of removing most of the stack/parameter management that was in there too, so the code becomes very compact:

    return obj.SomeWork(input, step);
push        rax  
vzeroupper  
movsx       rax,byte ptr [rcx+8]  
vmovaps     xmm0,xmm1  
vaddsd      xmm0,xmm0,xmm1  ; Calculate - got inlined
vmulsd      xmm1,xmm1,xmm1  ; Derivate - got inlined
vmulsd      xmm1,xmm1,xmm2  ; dv * step
vsubsd      xmm0,xmm0,xmm1  ; f - 
add         rsp,8  
ret  
Cory Nelson
  • 29,236
  • 5
  • 72
  • 110
  • This is a very clever solution! Thanks for the post. It absolutely works for classes where `mathFunction_` is `readonly`, but I also have cases where it is a mutable field. Nevertheless, I will experiment with this. – Haus Dec 14 '18 at 19:45
  • What is that trailing underscore there for? – Robert Harvey Dec 14 '18 at 22:16
  • @RobertHarvey it's part of the field's name – Cory Nelson Dec 14 '18 at 22:20
  • Yes, I get that. What is its purpose? – Robert Harvey Dec 14 '18 at 22:22
  • 3
    @RobertHarvey i assume it is a convention denoting a private field. It is part of the question's code. – Cory Nelson Dec 14 '18 at 22:26
  • 3
    I don't think this will actually do anything to optimize it. From what I understand of CLR internals (and, admittedly it may have changed in the last 3 or so years), the CLR will only generate one version of the generic class internally. i.e. it will generate an IMathFunction call rather than a direct call. If TMathFunction is a struct, it will generate separate code for each type of TMathFunction. The only way I could think to optimize it all the way would be to use an abstract base class (because virtual calls are faster than interface calls) or require where T: struct, IMathFunction. – Colorfully Monochrome Dec 14 '18 at 23:28
  • 1
    @ColorfullyMonochrome is right: you need to use a struct to get a separate reification for each type. Adding `where TMathFunction : struct, IMathFunction` is the way to go to guarantee optimal code. We use this in some places where perf matters a lot, and we even have run-time codegen to automate this where it would otherwise go out of control ([example here](https://github.com/disruptor-net/Disruptor-net/blob/master/src/Disruptor/Internal/StructProxy.cs)). – Lucas Trzesniewski Dec 15 '18 at 18:28
9

I would assign the methods to delegates. This allows you to still program against the interface, while avoiding the interface method resolution.

public SomeObject
{
    private readonly Func<double, double> _calculate;
    private readonly Func<double, double> _derivate;

    public SomeObject(IMathFunction mathFunction)
    {
        _calculate = mathFunction.Calculate;
        _derivate = mathFunction.Derivate;
    }

    public double SomeWork(double input, double step)
    {
        var f = _calculate(input);
        var dv = _derivate(input);
        return f - (dv * step);
    }
}

In response to @CoryNelson's comment I made tests so see what the impact really is. I have sealed the function class, but this seems to make absolutely no difference since my methods are not virtual.

Test Results (mean time of 100 million iterations in ns) with the empty method time subtracted in braces:

Empty Work method: 1.48
Interface: 5.69 (4.21)
Delegates: 5.78 (4.30)
Sealed Class: 2.10 (0.62)
Class: 2.12 (0.64)

The delegate version time is about the same as for the interface version (the exact times vary from test execution to test execution). While working against the class is about 6.8 x faster (comparing times minus the empty work method time)! This means that my suggestion to work with delegates was not helpful!

What surprised me was, that I expected a much longer execution time for the interface version. Since this kind of test does not represent the exact context of the OP's code, its validity is limited.

static class TimingInterfaceVsDelegateCalls
{
    const int N = 100_000_000;
    const double msToNs = 1e6 / N;

    static SquareFunctionSealed _mathFunctionClassSealed;
    static SquareFunction _mathFunctionClass;
    static IMathFunction _mathFunctionInterface;
    static Func<double, double> _calculate;
    static Func<double, double> _derivate;

    static TimingInterfaceVsDelegateCalls()
    {
        _mathFunctionClass = new SquareFunction();
        _mathFunctionClassSealed = new SquareFunctionSealed();
        _mathFunctionInterface = _mathFunctionClassSealed;
        _calculate = _mathFunctionInterface.Calculate;
        _derivate = _mathFunctionInterface.Derivate;
    }

    interface IMathFunction
    {
        double Calculate(double input);
        double Derivate(double input);
    }

    sealed class SquareFunctionSealed : IMathFunction
    {
        public double Calculate(double input)
        {
            return input * input;
        }

        public double Derivate(double input)
        {
            return 2 * input;
        }
    }

    class SquareFunction : IMathFunction
    {
        public double Calculate(double input)
        {
            return input * input;
        }

        public double Derivate(double input)
        {
            return 2 * input;
        }
    }

    public static void Test()
    {
        var stopWatch = new Stopwatch();

        stopWatch.Start();
        for (int i = 0; i < N; i++) {
            double result = SomeWorkEmpty(i);
        }
        stopWatch.Stop();
        double emptyTime = stopWatch.ElapsedMilliseconds * msToNs;
        Console.WriteLine($"Empty Work method: {emptyTime:n2}");

        stopWatch.Restart();
        for (int i = 0; i < N; i++) {
            double result = SomeWorkInterface(i);
        }
        stopWatch.Stop();
        PrintResult("Interface", stopWatch.ElapsedMilliseconds, emptyTime);

        stopWatch.Restart();
        for (int i = 0; i < N; i++) {
            double result = SomeWorkDelegate(i);
        }
        stopWatch.Stop();
        PrintResult("Delegates", stopWatch.ElapsedMilliseconds, emptyTime);

        stopWatch.Restart();
        for (int i = 0; i < N; i++) {
            double result = SomeWorkClassSealed(i);
        }
        stopWatch.Stop();
        PrintResult("Sealed Class", stopWatch.ElapsedMilliseconds, emptyTime);

        stopWatch.Restart();
        for (int i = 0; i < N; i++) {
            double result = SomeWorkClass(i);
        }
        stopWatch.Stop();
        PrintResult("Class", stopWatch.ElapsedMilliseconds, emptyTime);
    }

    private static void PrintResult(string text, long elapsed, double emptyTime)
    {
        Console.WriteLine($"{text}: {elapsed * msToNs:n2} ({elapsed * msToNs - emptyTime:n2})");
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static double SomeWorkEmpty(int i)
    {
        return 0.0;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static double SomeWorkInterface(int i)
    {
        double f = _mathFunctionInterface.Calculate(i);
        double dv = _mathFunctionInterface.Derivate(i);
        return f - (dv * 12.34534);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static double SomeWorkDelegate(int i)
    {
        double f = _calculate(i);
        double dv = _derivate(i);
        return f - (dv * 12.34534);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static double SomeWorkClassSealed(int i)
    {
        double f = _mathFunctionClassSealed.Calculate(i);
        double dv = _mathFunctionClassSealed.Derivate(i);
        return f - (dv * 12.34534);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static double SomeWorkClass(int i)
    {
        double f = _mathFunctionClass.Calculate(i);
        double dv = _mathFunctionClass.Derivate(i);
        return f - (dv * 12.34534);
    }
}

The idea of [MethodImpl(MethodImplOptions.NoInlining)] is to prevent the compiler from calculating the addresses of the methods before the loop if the method was inlined.

Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188