1

I used std::chrono::high_resolution_clock to measure std::lower_bound execution time. Here is my test code:

#include <iostream>
#include <algorithm>
#include <chrono> 
#include <random> 

const long SIZE = 1000000;

using namespace std::chrono;
using namespace std;

int sum_stl(const std::vector<double>& array, const std::vector<double>& vals) 
{
  long temp;
  auto t0 = high_resolution_clock::now();  
  for(const auto& val : vals) {
    temp += lower_bound(array.begin(),array.end(),val) - array.begin();
  }
  auto t1 = high_resolution_clock::now();
  cout << duration_cast<duration<double>>(t1-t0).count()/vals.size() 
       << endl;
  return temp;
}

int main() {
  const int N = 1000;
  vector<double> array(N);

  auto&& seed = high_resolution_clock::now().time_since_epoch().count();
  mt19937 rng(move(seed));
  uniform_real_distribution<float> r_dist(0.f,1.f);

  generate(array.begin(),array.end(),[&](){return r_dist(rng);});
  sort(array.begin(), array.end());

  vector<double> vals;
  for(int i = 0; i < SIZE; ++i) {
    vals.push_back(r_dist(rng));
  }

  int index = sum_stl(array, vals);

  return 0;
}

array is a sorted vector with 1000 uniformed random numbers. vals has a size of 1 million. At first I set the timer inside the loop to measure every single std::lower_bound execution, timing result was around 1.4e-7 seconds. Then I tested other operations like +, -, sqrt, exp, but they all gave the same result as std::lower_bound.

In a former topic resolution of std::chrono::high_resolution_clock doesn't correspond to measurements, it's said that the 'chrono' resolution might be not enough to represent a duration less than 100 nanoseconds. So I set a timer for the whole loop and get an average by dividing the iteration number. Here is the output:

1.343e-14

There must be something wrong since it gave a duration even less than the CPU cycle time, but I just can't figure it out.

To make the question more general, how can I measure accurate execution time for a short function?

saclayer
  • 9
  • 4
  • 1
    Well, one possibility would be that you're calling `sum_stl` in a way that lets the compiler see the result is unused, and optimise away the whole loop. In order to either rule out problems such as that one, or make them apparent in the question, please provide a minimal complete program, not a standalone function. –  Oct 06 '15 at 08:08
  • I just updated the full code. By the way, I used a `-O2` compile option, that will lead to the optimization you have mentioned? – saclayer Oct 06 '15 at 09:05
  • If that had been the problem, then printing `sum_stl`'s result should make a big difference. Modifying your code to print `sum_stl`'s result, I don't see a big difference, so the problem must be somewhere else. Your question as it is now looks like a good one, your results are reproducible. I don't immediately see any obvious problems in your test. Even if there is a major issue in your test, that only means your question would be very easy for someone else to answer. :) –  Oct 06 '15 at 13:18
  • Step 1: print out total time, not average time, and then ensure that your total time is proportional to `vals.size()`. If the size goes up by 10x, so should the total time. If it doesn't, you've got a problem to solve. Otherwise, you're getting a real measurement. If you bump total time up enough, you'll eliminate questions about clock precision. – Howard Hinnant Oct 06 '15 at 15:27
  • 1
    The program exhibits undefined behavior as it reads the value of an uninitialized variable. Here: `temp += lower_bound(...)` – Igor Tandetnik Oct 06 '15 at 19:16

0 Answers0