Profiling Expression Template

Question

I'm trying to profile the expression template similar to the one on the book "C++ Template" by David Vandevoorde. Below is my theoretical analysis, which is probably wrong because the test shows unexpected results. Suppose the test is about:

R = A + B + C;

where A, B, C, R are arrays allocated on the heap. The size of the array is 2. So the following will be executed:

R[0] = A[0] + B[0] + C[0]; // 3 loads + 2 additions + 1 store
R[1] = A[1] + B[1] + C[1];

with approximately 12 instructions (6 for each).

Now, if expression template is enabled (shown at the very bottom), after type deduction is done at compiler time, the following will be processed at run time before the identical evaluation is performed as the above one:

A + B --> expression 1 // copy references to A & B 
expression 1 + C --> expression 2 // copy the copies of references to A & B
                                  // + copy reference to C

Therefore, there's totaly 2+3=5 instructions before the evaluation, which is about 5/(5+12)=30% of the total instructions. So I should be able to see this overhead especially when the vector size is small.

But the result shows that the cost for the two are nearly the same. I iterate the test for 1E+09 times. The assembly codes for the two are the same, of course. But I couldn't find the part for this "construction" part that costs any time or instructions.

movsdq  (%r9,%rax,8), %xmm0
addsdq  (%r8,%rax,8), %xmm0
addsdq  (%rdi,%rax,8), %xmm0
movsdq  %xmm0, (%rcx,%rax,8)

I don't have a good CS background so this question may be so stupid. But I've been scratching my head for days on this. So any help is appreaciated!

--- My expression template ---

template< typename Left, typename Right >
class V_p_W // stands for V+W
{
public:
   typedef typename array_type::value_type             value_type;
   typedef double                                      S_type;
   typedef typename traits< Left >::type               V_type;
   typedef typename traits< Right >::type              W_type;

   V_p_W ( const Left& _v, const Right& _w ) : V(_v), W(_w)
   {}

   inline value_type operator [] ( std::size_t i )        { return V[i] + W[i]; }
   inline value_type operator [] ( std::size_t i ) const  { return V[i] + W[i]; }
   inline std::size_t size () const                       { return V.size();  }

private:
   V_type V;
   W_type W;
};

where traits does nothing but to decide if the value of the reference of the object should be taken. For example, the value is copied for an integer but the reference is taken for an array.

I may be missing something, but what is "the" expression template you are referring to? It seems like you are assuming some library, but unless we know *which* one, any speculation is completely arbitrary. Nevertheless, if you assign the expression template to an actual value (`R` is supposed to be an array on the heap), it must of course be resolved... — danielschemmel, Jan 29 '15 at 18:01
@gha.st I mean the classic expression template, similar to that on the book "C++ Template". I'll update my question. Thanks! BTW, what do you mean by saying "resolved"? — user3156285, Jan 29 '15 at 18:07
The operations need to be actually performed. To get `R[0]` right, you *need* to perform `A[0] + B[0] + C[0]` and if you only have 2-operand addition, then two adds are the only reasonable way to do it! And it quite frankly does not matter the least *how* those additions come to take place. — danielschemmel, Jan 29 '15 at 18:10
Nevertheless, if you want any sensible answer, you need to give a complete example, that other people can actually compile and look at themselves. — danielschemmel, Jan 29 '15 at 18:12
Showing the assembly code in both situations may also help in understanding what is going on. Your presumptions about what expression templates are doing may not quite be right. — Suedocode, Jan 29 '15 at 18:21
@Aggieboy The assembly codes are identical for the two. Basically "mov -> add -> add -> mov". I couldn't see this "construction" process in the assembly, like it didn't happen at all. But it should happen at runtime, right? That's why I'm confusing. — user3156285, Jan 29 '15 at 18:32
OMG you mean THE classic expression template? Never heard of it. — Captain Obvlious, Jan 29 '15 at 18:33
What compiler optimizations were used? Do you see a difference with `-O0`? I think the expectation here is allocation of temporary vectors that result from `operator+`, but you are claiming that it's not being seen in the non-expression template version. I suspect the problem may be related to [return value optimization](http://en.wikipedia.org/wiki/Return_value_optimization), but other than that this example may be too simple to utilize the compile-time benefits. — Suedocode, Jan 29 '15 at 18:45

Suedocode · Answer 1 · 2015-02-04T15:43:52.467

My home brewed test. Ideally, expression templates save the extra allocations required by temporary vectors in the naive case.

expr.cpp:

#include <vector>
#include <stdlib.h>
#include <iostream>
#include <ctime>

using namespace std;

typedef vector<int> Valarray;

template<typename L, typename R>
struct BinOpPlus {
  const L& left;
  const R& right;

  BinOpPlus(const L& l, const R& r)
    : left(l), right(r)
  {}

  int operator[](int i) const { return left[i] + right[i]; }
};

template<typename L, typename R>
BinOpPlus<L, R> operator+(const L& left, const R& right){
  return BinOpPlus<L, R>(left, right);
}

int main() {
  int size = 10000000;
  Valarray v[3];
  for(int n=0; n<3; ++n){
    for(int i=0; i<size; ++i){
      int val = rand() % 100;
      v[n].push_back(val);
    }
  }

  std::clock_t start = std::clock();
  auto tmp = v[0] + v[1];
  auto out = tmp + v[2];

  int sum = 0;
  for(int i=0; i<size; ++i)
    sum += out[i];

  std::clock_t stop = std::clock();
  cout << "Sum: " << sum << endl;
  cout << "Time: " << (stop-start) << endl;
  return 0;
}

vala.cpp:

#include <vector>
#include <stdlib.h>
#include <iostream>
#include <ctime>

using namespace std;

class Valarray : public vector<int> {
  public:
    Valarray operator+(const Valarray& r) const {
      Valarray out;
      out.reserve(r.size());
      for(size_t i=0; i<r.size(); ++i)
        out.push_back((*this)[i] + r[i]);
      return out;
    }

    Valarray operator+(Valarray&& r) const {
      for(size_t i=0; i<r.size(); ++i)
        r[i] = (*this)[i] + r[i];
      return r;
    }
};

int main() {
  int size = 10000000;
  Valarray v[3];
  for(int n=0; n<3; ++n){
    for(int i=0; i<size; ++i){
      int val = rand() % 100;
      v[n].push_back(val);
    }
  }

  std::clock_t start = std::clock();
  Valarray out = v[0] + v[1] + v[2];

  int sum = 0;
  for(int i=0; i<size; ++i)
    sum += out[i];

  std::clock_t stop = std::clock();
  cout << "Sum: " << sum << endl;
  cout << "Time: " << (stop-start) << endl;
  return 0;
}

Command line:

g++ -Wfatal-errors -std=c++11 -Wall -Werror vala.cpp -o vala
g++ -Wfatal-errors -std=c++11 -Wall -Werror expr.cpp -o expr
~/example$ ./vala
Sum: 1485274472
Time: 680000
~/example$ ./expr
Sum: 1485274472
Time: 130000

With optimizations:

g++ -Wfatal-errors -std=c++11 -Wall -Werror vala.cpp -o vala -O3
g++ -Wfatal-errors -std=c++11 -Wall -Werror expr.cpp -o expr -O3
na:~/example$ ./vala
Sum: 1485274472
Time: 290000
na:~/example$ ./expr
Sum: 1485274472
Time: 10000

Massive improvement with expression templates, because it avoids extra vector allocations.

Profiling Expression Template

1 Answers1