4

I'm using a DAG (directed acyclic graph) to represent and evaluate expressions; each node represents an operation (+, -, /, *, accumulate, etc...) and evaluation of the entire expression is achieved by sequentially evaluating each node in topologically sorted order. Each node inherits for a base class RefNode and implements a virtual function, evaluate, according to the operator it represents. The Node class is templatized on a functor that represents the operator. The node evaluation order is maintained in a vector<RefNode*> with ->evaluate() calls made to each element.

Some quick profiling shows that a virtual evaluate slows down an addition node by a factor of 2x [1], either from the overhead or trashing the branch prediction.

As a first step encoded the type information as an integer an used static_cast accordingly. This did help, but its clunky and I'd rather not jump around in the hot portion of my code.

struct RefNode {
    double output;
    inline virtual void evaluate(){}
};

template<class T>
struct Node : RefNode {
    double* inputs[NODE_INPUT_BUFFER_LENGTH];
    T evaluator;
    inline void evaluate(){ evaluator(inputs, output); }
};

struct Add {
    inline void operator()(double** inputs, double &output)
    {
        output=*inputs[0]+*inputs[1];
    }
};

An evaluation may look like:

Node<Add>* node_1 = ...
Node<Add>* node_2 = ...
std::vector<RefNode*> eval_vector;

eval_vector.push_back(node_1);
eval_vector.push_back(node_2);

for (auto&& n : eval_vector) {
    n->evaluate();
}

I have the following questions, bearing in mind performance is critical:

  1. How can I avoid virtual functions in this situation?
  2. If not, how can I change the way I represent an expression graph to support multiple operations, some of which must hold state, and avoid virtual function calls.
  3. How do other frameworks such as Tensorflow/Theano represent computational graphs?

[1] A single addition operation on my system takes ~ 2.3ns with virtual functions and 1.1ns without. While this is small, the entire computational graph is mostly addition nodes and hence there is a good portion of time to be saved.

Guillaume Racicot
  • 39,621
  • 9
  • 77
  • 141
user1893603
  • 61
  • 1
  • 3
  • You will have to use templates. However, this requires the graph to be known at compile time. If the calculation graph is constructed at runtime (say, user input) then I don't see a way to avoid virtual calls. – MadScientist Feb 27 '17 at 14:09
  • to avoid virtual functions you may use curiously recurring template pattern – Andrew Kashpur Feb 27 '17 at 14:20
  • Thanks Andrew. I did have a look @ CRTP but I struggled to see how I could use it in this case. Would you mind expanding a little? – user1893603 Feb 27 '17 at 14:26
  • How big is your graph? Is it known at compile time? Is the graph constant in consecutive calls? Can you simplify it automatically? Can you generate ASM-Code dynamically? What about other representations, e.g. stack? There are so many question and solutions. – knivil Feb 27 '17 at 14:36
  • never mind crtp, you can do this without inheritance: you'll have class with fields : input, output, evaluator. In ctor your assign proper function pointer to evaluator field (you pass node_type as parameter to ctor) – Andrew Kashpur Feb 27 '17 at 15:00
  • The largest graph is probably in the 1000s nodes, it's not known at compile time but once built is static. The order of calls will remain the same, however a node will only be evaluated if it's dependants change in value. I'm certainly open to other representations if you have any suggestions, performance is the main requirement. I can generate ASM on the fly but I'm afraid I'm not sure where to start. – user1893603 Feb 27 '17 at 15:02
  • @Andrew Kashpur. Ah that makes sense. Currently the evaluator is a functor so that it can maintain information like parameters and state as member variables. The overloaded call () then has then access to to these for calculation. In the case of of function pointers one could pass these in as arguments but I'm afraid the signature of the function would differ across nodes. Do you have any thoughts on how to overcome this? – user1893603 Feb 27 '17 at 15:11
  • Trading a virtual function call for a function pointer or a `std::function` is unlikely to matter much, performance-wise. Ultimately, all are indirect calls, which stall the instruction prefetcher. This is correct: the whole point of the computational graph is to have data-driven next instruction ! Only code generation is fast, precisely because that turns data into code and does allow the CPU to prefetch. – MSalters Feb 27 '17 at 15:30
  • this is quite interesting. I would say that the fastest thing you can do here is perform some in-place code generation (e.g. generate and compile a c/c++ function for your graph). I know that Python can do that sometimes, but perhaps this is a bit too meta. – Ap31 Mar 01 '17 at 06:56

1 Answers1

0

As mentioned in the comments, you will need to know the graph at compile-time to remove the virtual dispatch. To do that, you only need to use a std::tuple:

auto eval_vector = std::make_tuple(
    Node<Add>{ ... },
    Node<Add>{ ... },
    ...
);

Then, you only need to remove the virtual and override keywords, and removing the empty function in the base class.

You will find that range based for loop don't support tuples yet. To iterate on it, you will need that function:

template<typename T, typename F, std::size_t... S>
void for_tuple(std::index_sequence<S...>, T&& tuple, F&& function) {
    int unpack[] = {(static_cast<void>(
        function(std::get<S>(std::forward<T>(tuple))
    ), 0)..., 0};
    static_cast<void>(unpack);
}

template<typename T, typename F>
void for_tuple(T&& tuple, F&& function) {
    constexpr std::size_t N = std::tuple_size<std::remove_reference_t<T>>::value;
    for_tuple(std::make_index_sequence<N>{}, std::forward<T>(tuple), std::forward<F>(function));
}

You can then iterate on your tuple like that:

for_tuple(eval_vector, [](auto&& node){
    node.evaluate();
});
Guillaume Racicot
  • 39,621
  • 9
  • 77
  • 141
  • Thanks Guillaume! Unfortunately the graph is not know @ compile time. How is locality, is iterating though a tuple going to trash the cache? What is the overhead like when iterating through a tuple vs a contiguous block of pointers? – user1893603 Feb 27 '17 at 15:23
  • @user1893603 ah I see, then a function pointer might be your solution. As for iterating over tuples, it's really fast. The loop is actually unrolled at compile-time (see `unpack` in `for_tuple`) and all values lives on the stack contiguously. – Guillaume Racicot Feb 27 '17 at 15:28