Avoiding virtual function calls in a computational graph

Question

I'm using a DAG (directed acyclic graph) to represent and evaluate expressions; each node represents an operation (+, -, /, *, accumulate, etc...) and evaluation of the entire expression is achieved by sequentially evaluating each node in topologically sorted order. Each node inherits for a base class RefNode and implements a virtual function, evaluate, according to the operator it represents. The Node class is templatized on a functor that represents the operator. The node evaluation order is maintained in a vector<RefNode*> with ->evaluate() calls made to each element.

Some quick profiling shows that a virtual evaluate slows down an addition node by a factor of 2x [1], either from the overhead or trashing the branch prediction.

As a first step encoded the type information as an integer an used static_cast accordingly. This did help, but its clunky and I'd rather not jump around in the hot portion of my code.

struct RefNode {
    double output;
    inline virtual void evaluate(){}
};

template<class T>
struct Node : RefNode {
    double* inputs[NODE_INPUT_BUFFER_LENGTH];
    T evaluator;
    inline void evaluate(){ evaluator(inputs, output); }
};

struct Add {
    inline void operator()(double** inputs, double &output)
    {
        output=*inputs[0]+*inputs[1];
    }
};

An evaluation may look like:

Node<Add>* node_1 = ...
Node<Add>* node_2 = ...
std::vector<RefNode*> eval_vector;

eval_vector.push_back(node_1);
eval_vector.push_back(node_2);

for (auto&& n : eval_vector) {
    n->evaluate();
}

I have the following questions, bearing in mind performance is critical:

How can I avoid virtual functions in this situation?
If not, how can I change the way I represent an expression graph to support multiple operations, some of which must hold state, and avoid virtual function calls.
How do other frameworks such as Tensorflow/Theano represent computational graphs?

[1] A single addition operation on my system takes ~ 2.3ns with virtual functions and 1.1ns without. While this is small, the entire computational graph is mostly addition nodes and hence there is a good portion of time to be saved.

You will have to use templates. However, this requires the graph to be known at compile time. If the calculation graph is constructed at runtime (say, user input) then I don't see a way to avoid virtual calls. — MadScientist, Feb 27 '17 at 14:09
to avoid virtual functions you may use curiously recurring template pattern — Andrew Kashpur, Feb 27 '17 at 14:20
Thanks Andrew. I did have a look @ CRTP but I struggled to see how I could use it in this case. Would you mind expanding a little? — user1893603, Feb 27 '17 at 14:26
How big is your graph? Is it known at compile time? Is the graph constant in consecutive calls? Can you simplify it automatically? Can you generate ASM-Code dynamically? What about other representations, e.g. stack? There are so many question and solutions. — knivil, Feb 27 '17 at 14:36
never mind crtp, you can do this without inheritance: you'll have class with fields : input, output, evaluator. In ctor your assign proper function pointer to evaluator field (you pass node_type as parameter to ctor) — Andrew Kashpur, Feb 27 '17 at 15:00
The largest graph is probably in the 1000s nodes, it's not known at compile time but once built is static. The order of calls will remain the same, however a node will only be evaluated if it's dependants change in value. I'm certainly open to other representations if you have any suggestions, performance is the main requirement. I can generate ASM on the fly but I'm afraid I'm not sure where to start. — user1893603, Feb 27 '17 at 15:02
@Andrew Kashpur. Ah that makes sense. Currently the evaluator is a functor so that it can maintain information like parameters and state as member variables. The overloaded call () then has then access to to these for calculation. In the case of of function pointers one could pass these in as arguments but I'm afraid the signature of the function would differ across nodes. Do you have any thoughts on how to overcome this? — user1893603, Feb 27 '17 at 15:11
Trading a virtual function call for a function pointer or a `std::function` is unlikely to matter much, performance-wise. Ultimately, all are indirect calls, which stall the instruction prefetcher. This is correct: the whole point of the computational graph is to have data-driven next instruction ! Only code generation is fast, precisely because that turns data into code and does allow the CPU to prefetch. — MSalters, Feb 27 '17 at 15:30
this is quite interesting. I would say that the fastest thing you can do here is perform some in-place code generation (e.g. generate and compile a c/c++ function for your graph). I know that Python can do that sometimes, but perhaps this is a bit too meta. — Ap31, Mar 01 '17 at 06:56

score 0 · Answer 1 · answered Feb 27 '17 at 15:06

As mentioned in the comments, you will need to know the graph at compile-time to remove the virtual dispatch. To do that, you only need to use a std::tuple:

auto eval_vector = std::make_tuple(
    Node<Add>{ ... },
    Node<Add>{ ... },
    ...
);

Then, you only need to remove the virtual and override keywords, and removing the empty function in the base class.

You will find that range based for loop don't support tuples yet. To iterate on it, you will need that function:

template<typename T, typename F, std::size_t... S>
void for_tuple(std::index_sequence<S...>, T&& tuple, F&& function) {
    int unpack[] = {(static_cast<void>(
        function(std::get<S>(std::forward<T>(tuple))
    ), 0)..., 0};
    static_cast<void>(unpack);
}

template<typename T, typename F>
void for_tuple(T&& tuple, F&& function) {
    constexpr std::size_t N = std::tuple_size<std::remove_reference_t<T>>::value;
    for_tuple(std::make_index_sequence<N>{}, std::forward<T>(tuple), std::forward<F>(function));
}

You can then iterate on your tuple like that:

for_tuple(eval_vector, [](auto&& node){
    node.evaluate();
});

Thanks Guillaume! Unfortunately the graph is not know @ compile time. How is locality, is iterating though a tuple going to trash the cache? What is the overhead like when iterating through a tuple vs a contiguous block of pointers? — user1893603, Feb 27 '17 at 15:23
@user1893603 ah I see, then a function pointer might be your solution. As for iterating over tuples, it's really fast. The loop is actually unrolled at compile-time (see `unpack` in `for_tuple`) and all values lives on the stack contiguously. — Guillaume Racicot, Feb 27 '17 at 15:28

Avoiding virtual function calls in a computational graph

1 Answers1