Why do 2 opeartions without loop unrolling and with loop unrolling give the same performance?

Question

I am studying about memory in C++. But there is one thing that makes me doubtful. I am trying 2 different methods for array sum. One is that I access only 1 index at a time and increment i by 1. Another is that I access 5 indices of array at a time and increment i by 5. With 40 million elements, at first, I thought the one with loop unrolling might work better by reducing execution time. But both of the results are the same. I would like to know why?

#include <bits/stdc++.h>

using namespace std;
using namespace chrono;

void printVector(vector<int>& vect);

int main(int argc, char const *argv[])
{
   int n = 40000000;
   vector<int> a(n);
   vector<int> b(n);
   vector<int> c(n);

   srand((unsigned) time(0)); 

   for (int i = 0; i < n; ++i)
   {
      a[i] = ((rand() % 100) + 1);
      b[i] = ((rand() % 100) + 1);
      c[i] = ((rand() % 100) + 1);
   }

   // printVector(a);
   // printVector(b);
   // printVector(c);

   auto start = steady_clock::now();
   vector<int> vect1(n);
   for (int i = 0; i < n; i++) {
      vect1[i] = a[i] + b[i] + c[i];
   }
   // printVector(vect1);
   auto end = steady_clock::now();
   cout << duration_cast<milliseconds>(end - start).count() << " milliseconds" << endl;

   start = steady_clock::now();
   vector<int> vect2(n);
   for (int i = 0; i < n; i+=10) {
      vect2[i] = a[i] + b[i] + c[i];
      vect2[i+1] = a[i+1] + b[i+1] + c[i+1];
      vect2[i+2] = a[i+2] + b[i+2] + c[i+2];
      vect2[i+3] = a[i+3] + b[i+3] + c[i+3];
      vect2[i+4] = a[i+4] + b[i+4] + c[i+4];
      vect2[i+5] = a[i+5] + b[i+5] + c[i+5];
      vect2[i+6] = a[i+6] + b[i+6] + c[i+6];
      vect2[i+7] = a[i+7] + b[i+7] + c[i+7];
      vect2[i+8] = a[i+8] + b[i+8] + c[i+8];
      vect2[i+9] = a[i+9] + b[i+9] + c[i+9];
   }
   // printVector(vect2);
   end = steady_clock::now();
   cout << duration_cast<milliseconds>(end - start).count() << " milliseconds" << endl;
   return 0;
}

void printVector(vector<int>& vect) {
    cout << "Vector elements: " << endl;
    for (int i = 0; i < vect.size(); i++) {
       cout << vect[i] << " "; 
    }
    cout << endl;
}

You talk about C/C++. Since when are namespaces in the C part of C/C++? FYI, there is no C/C++ language. — Thomas Matthews, Aug 02 '22 at 19:42
If you want performance, use an array of `struct` instead of parallel arrays. With parallel arrays, the processor may need to reload the cache because `b[0]` may follow `a[n]`, whereas with the struct, `b[0]` will follow `a[0]`. — Thomas Matthews, Aug 02 '22 at 19:44
You are running through N elements in both codes, so both should have about the same performance. Even though you have `N/10` less checks in the unrolled for loop for the end condition, the branch predictor basically makes that meaningless. If you have compiler optimizations turned on the compiler may even optimize both loops to the same code. — NathanOliver, Aug 02 '22 at 19:44
Hello, I have edited my question. The first loop runs O(n) and the second one runs O(n/10) based on my understanding. But when I measure the time in milliseconds, I find out that the performance of the two loops is the same or very similar with 3 40-million element vectors. — Liu Bei, Aug 02 '22 at 19:46
The processor may have enough room to cache the instructions of your loop. Also the branch prediction may be simple. Also, if your iteration count is small, the difference between unrolled and regular loops may be insignificant (or difficult to measure). — Thomas Matthews, Aug 02 '22 at 19:46
O(n) and O(n/10) are the same. You are misusing (or misunderstanding) the big-O notation. — john, Aug 02 '22 at 19:48
Both your loops are actually O(n), since `n` operations must be performed. `n` does not refer to the number of loop iterations, it's the number of operations. Otherwise, you would get rid of loops all the time, and always have O(1) — ChrisMM, Aug 02 '22 at 19:49
So unfolding the loop by 10 does not help the performance... Even unfolding by 1000 or 10000 at a time, it also doesn't help, right? — Liu Bei, Aug 02 '22 at 19:54
The truth is in the assembly language. Print the assembly language for both loops and compare them. Check if the compiler is emitting any special processor instructions (such as SIMD or emitting them for parallel execution). — Thomas Matthews, Aug 02 '22 at 19:54
@LiuBei If you turn on any compiler optimizations the compiler will be smarter than you and me. These kind of exercises are moot to the compiled optimized code. — Captain Giraffe, Aug 02 '22 at 19:58
Ohh.. one thing I forgot to tell you all. I use only `g++ -g test.cpp` and `./a.out`in the terminal and my OS is Linux. I am not sure if there is any compiler optimization... — Liu Bei, Aug 02 '22 at 20:01
Performance testing is not simple. A test of N=1 is not representative. You should learn to use a benchmark framework like [Google Benchmark](https://github.com/google/benchmark). — JHBonarius, Aug 02 '22 at 20:06
The point of this is just I would like to know why loop unrolling does not help the performance... I do not want to optimize anything... I will try running Assembly language as Thomas suggested. I may see the answer to my "why". — Liu Bei, Aug 02 '22 at 20:08
@LiuBei Thomas was not suggesting that you write this in assembly, that would likely perform even worse. But rather that you check out the compiled code on https://godbolt.org/z/nd5Mjza5M . — Captain Giraffe, Aug 02 '22 at 20:48
Maybe the reason both ways take the same time is that both cases are equivalent to doing nothing. The vectors aren't used after the loop and the compiler can eliminate the whole loop and the vector. So you are likely to measure just how fast you can read the clock. — Goswin von Brederlow, Aug 03 '22 at 17:50

meaning-matters · Answer 1 · 2022-08-02T20:51:37.897

1

Loop unrolling is a common compiler optimisation and may have been done for you.

Then, without reading vect1 and vect2, because you've commented out the print statements, the compiler could have optimised away the loops in which you write them.

The compiler could have even optimised away the loop in which you set a, b, and c.

As was said in the comments, understanding performance is hard for various reasons including CPU (cache implementation) and compiler (optimisations) aspects.

edited Aug 02 '22 at 20:51

answered Aug 02 '22 at 20:25

meaning-matters

21,929
10
82
142

The compiler might even re-roll the loop. The code that adds 5 elements per loop might get transformed into only doing 4 elements per loop. Although that is less likely. – Goswin von Brederlow Aug 03 '22 at 17:46
1

Note: gcc has a problem with missed optimization in the above code that it won't eliminate `a`, `b`, `c`. clang on the other hand will eliminate those provided `n` is small enough that it will eliminate the `vect1` and `vect2` loops (clang is to stupid to understand that they have no side-effect unless it unrolls them completely internally, which is size limited). With small `n` clang will only call `rand()` (can't optimitze that function call away) and never use the result and `a`, `b` and `c` will be optimized away. – Goswin von Brederlow Aug 03 '22 at 17:55

Why do 2 opeartions without loop unrolling and with loop unrolling give the same performance?

1 Answers1