-1

I am trying to measure the execution time of a dot product, but I am finding difference depending on the variable used for storing the final result, i.e., when using an integer the result is 0ms but when using an element of array the time is much higher.

Could it be related with the compiler, when using an integer variable, is able to perform the vectorization of the loop?

Here is my code

#include <stdio.h>
#include <iostream>
#include <time.h> 

using namespace std;

void main(int argc, char* argv[])
{
    int* a = new int[2000000000];
    for (unsigned long long i = 0; i < 2000000000; i++)
        a[i] = 1;

    clock_t t1 = clock();
    int nResult = 0;
    for (unsigned long long i = 0; i < 2000000000; i++)
        nResult += a[i] * a[i];
    clock_t t2 = clock();
    cout << "Execution time = " << (int)(1000 * ((t2 - t1) / (double)CLOCKS_PER_SEC)) << " ms" << endl;

    t1 = clock();
    int b[1] = {0};
    for (unsigned long long i = 0; i < 2000000000; i++)
        b[0] += a[i] * a[i];
    t2 = clock();
    cout << "Execution time = " << (int)(1000 * ((t2 - t1) / (double)CLOCKS_PER_SEC)) << " ms" << endl;

    delete[] a;

    getchar();

    return;
}

And here is the output

Execution time = 0 ms
Execution time = 702 ms

Thanks in advance for your help

jvknc
  • 3
  • 2
  • 4
    The compiler will almost certainly optimize the second for loop straight down to `nResult = 2000000000` if you let it. – Zinki Jan 12 '18 at 09:06
  • Do you compile with optimization enabled? Probably the entire first loop is optimized to a no-op since you don't use the result. Making it `volatile` might change that. *"vectorization?"* Check the assembly and see? – HolyBlackCat Jan 12 '18 at 09:09
  • As you said, first loop is optimized to no-op since nResult is not being used after. Using it the time is similar in the 2 loops. Many thanks – jvknc Jan 12 '18 at 09:49

1 Answers1

0

I don't think this even has to do with vectorization. Even with SIMD instructions, you wouldn't go down to a flat 0ms.

What seems to be happening, however, is that the loop got completely removed. You're never using the value of nResult, and even if you did, the optimizer would be able to guess what the value would be and simply put it into the variable at compile-time.

Benchmarking is a counter-intuitive topic, where you need to disable some compiler optimizations to actually measure something, while still benchmarking the optimized code that would be present in a normal program.

You might want to give this talk a look, it's really good at explaining how to benchmark code properly : https://youtu.be/nXaxk27zwlk

AliaumeM
  • 58
  • 1
  • 5
  • Also, note that Clang removes **both** loops even at the first level of optimization : https://godbolt.org/g/5yf2Uc – AliaumeM Jan 12 '18 at 09:29
  • You are right the loop, for the case of using integer variable, is removed by the optimizer, since nResult is not used, Just using nResult in the code the times are similar in 2 cases. Many thanks for your help, mistery solved :) – jvknc Jan 12 '18 at 09:45