C++: Loop Optimization and Loop Unwinding (To loop or not to loop)

Question

Update:

This discussion went further than I expected so i'm updating this with the code that I was actually working on when this question popped up into my head. It was a decision between 8 and 16 lines of code to determine who the winner of a tic-tac-toe game for my intro to c++ course.

note: this is designed to be on-level with the course,

note 2: token is a char either x or o or ' ')

This is a question of optimization. If this is a repeat I apologize but I couldn't locate an answer elsewhere.

Basically, it came down to whether or not the following code would be better looped or not:

    char CheckForWinner() {

    //returns the token of the player that satisfies one of the winning requirements
    if (Square[0][0] == Square[0][1] && Square[0][0] == Square[0][2] ) { //If all three tokens in the first row are the same
        return Square[0][0]; //Return the token
    } else if (Square[1][0] == Square[1][1] && Square[1][0] == Square[1][2] ) { //Check the next row
        return Square[1][0]; //Return the token
    } else if (Square[2][0] == Square[2][1] && Square[2][0] == Square[2][2] ) {
        return Square[2][0];
    } else if (Square[0][0] == Square[1][0] && Square[0][0] == Square[2][0] ) { //If no rows satisfy conditions, check columns
        return Square[0][0]; //Return the token
    } else if (Square[0][1] == Square[1][1] && Square[0][1] == Square[2][1] ) { 
        return Square[0][1];
    } else if (Square[0][2] == Square[1][2] && Square[0][2] == Square[2][2] ) { 
        return Square[0][2];
    } else if (Square[0][0] == Square[1][1] && Square[0][0] == Square[2][2] ) { //finally, check diagonals
        return Square[0][0];
    } else if (Square[0][2] == Square[1][1] && Square[0][2] == Square[2][0] ) {
        return Square[0][2];
    }

    return ' ';
}

Is this more or less taxing on a system them simply typing 100 cout lines?

I'm curious because it seems like not only do we perform 100 cout lines but we also allocate a new variable to memory, and force the computer to process 100 math equations as well as output the data.

I can understand that a compiler may provide some level of optimization but I'd be interested to know on a more general level. Primarily, I compile using VisualStudio 2012 or MingGW (g++).

Time to read about [loop unwinding](http://en.wikipedia.org/wiki/Loop_unwinding). — Ted Hopp, Dec 02 '13 at 04:50
I wish people like you would post answers Ted, That article answered my question perfectly (even though it is Wikipedia 8-P) — gNerb, Dec 02 '13 at 04:53
I'm glad that the link answers your question, but my comment doesn't really qualify as an answer. From [meta](http://meta.stackexchange.com/a/95691/156967): _"A link-only answer is not a desirable answer.... If the link looks like it may be useful, convert it to a comment."_ — Ted Hopp, Dec 02 '13 at 04:57
In this particular example, I think you'd be hard-pressed to see a performance difference between a fully unrolled loop and the original. The fully unrolled loop is likely to be a shade slower, actually, because it will likely be much _larger_. Unrolling usually brings benefits, though, especially if the unrolled loop exposes other optimization opportunities such as vectorization. — Joe Z, Dec 02 '13 at 05:00
I disagree with the vast majority of that opinion. Your link was very helpful. The opinion itself even suggests that your comment does, in fact, satisfy the requirements to be considered an answer as it was not "link only" (roughly quoting "could not be bothered to enter even a single sentence"). It was a simple question that demanded a simple answer and your comment was just that. I'm at least glad you are getting some reputation from those up votes on your comment. — gNerb, Dec 02 '13 at 05:04
Heh. Comment up-votes don't add reputation. They're just a way for the community to indicate a useful comment. But per request, I've converted the comment to an answer. — Ted Hopp, Dec 02 '13 at 05:08
At the level of assembly language the loop management requires only one instruction per loop. This is likely to run faster than unrolling the loop because it keeps the code very small so it can execute out of on-chip cache. — ScottMcP-MVP, Dec 02 '13 at 05:16
@ScottMcP-MVP - That analysis completely ignores the potentially severe impact that the loop test has on instruction pipelining. A very simple loop like that can potentially execute many times faster when unrolled (depending on the computer architecture) because the loop form has a conditional jump that can mess up the look-ahead instruction cache at each loop iteration. Also, it's unlikely that a call to the I/O routines for `cout` will all fit in the on-chip cache. — Ted Hopp, Dec 02 '13 at 05:27

Jerry Coffin · Answer 1 · 2013-12-02T18:11:10.520

There is no single answer about whether unrolling all 100 iterations of the loop would be effective.

For "smaller" system with no code cache, chances are pretty good that unrolling all 100 iterations would be optimal, at least in terms of execution speed. On the other hand, a system small enough that its CPU doesn't have a cache will typically be constrained enough in other resources that doing so would be highly inadvisable.

If the system does have a cache, chances are quite good that unrolling all 100 iterations of the loop would tend to cause slower execution. The overhead of the loop itself almost certainly takes less time than re-fetching essentially identical code 100 times over.

In a typical case, loop unrolling is most effective when a few iterations of the loop are unrolled (but typically fewer than 100 iterations). In a typical case you'd see a broad plateau around 4 to 16 iterations being unrolled.

As is typical of many taking a first stab at optimization, however, I'd guess you're really looking in entirely the wrong direction. If you want to optimize that loop, chances are that (by far) the biggest gain will come from making a slight change to what you do in the loop. I'd be willing to bet that any improvement you get from unrolling the loop will be too small to measure dependably, not to mention actually notice (even if you increase the number of iterations from 100 to, say, a few million).

On the other hand, if you rewrite the loop to eliminate the unnecessary buffer flush every iteration:

for ( int i = 1; i <= 100; i++ ) 
    cout << i << "\n";

[In case you didn't realize it: std::endl inserts a new-line into a stream and flushes the stream. In most cases (probably including this one) the buffer flush is unnecessary probably inadvisable. Removing it can improve speed a lot--improvement by a factors of 8:1 or 10:1 is fairly common.]

Chances are that it won't take much to measure the difference in speed at all. There's a pretty fair chance you'll be able to measure it at 100 iterations, and if you try more iterations, the difference is likely to become almost painfully obvious.

When you're dealing with a loop that's not I/O bound, and not open to obvious, massive improvement like this one, loop unrolling is likely to become a more attractive option. In this case, you first need to be aware that most compilers can do loop unrolling automatically, so trying to do it in the source code is unlikely to help a lot unless that opens up opportunities for other optimizations (e.g., if you have a loop that really does one thing on even iterations and another on odd iterations, unrolling those two iterations can eliminate the condition and jumping and such completely, so doing it by hand may provide a meaningful improvement, since the compiler may not "notice" the odd/even pattern and eliminate the conditions, jumps, etc.

Also note that a modern CPU can (and typically will) execute code in parallel, and execute code speculatively, which can eliminate most of the overhead of a loop. Since the branch of the loop will nearly always be taken (i.e., in all but the last iteration) the CPU's branch predictor will predict it as taken, so the CPU may have several iterations worth of instructions "in flight" simultaneously, even when you don't unroll the loop. Most of the code for the loop itself (e.g., incrementing i) can be executed in parallel with at least some other code in the loop, so the overhead of the loop is likely to be quite minimal anyway.

Edit 2: Looking at the specific question at hand, I think I'd do the job rather differently. Instead of storing the TTT board as a 2D array, I'd store it as a pair of bitmaps, one for X's and the other for O's. This lets you test for an entire winning combination in a single action instead of three separate comparisons. Since each row is 3 bits, it's probably easiest to use octal for the constants:

static const std::array<short, 8> winners = {
    /* rows */      0007, 0070, 0700, 
    /* columns */   0111, 0222, 0444, 
    /* diagonals */ 0124, 0421
};

In this case, I almost certainly would use loops:

char CheckForWinner(short X, short O) { 
    // `winners` definition from above goes here.

    for (int i=0; i<winners.size(); i++) {
        if (X & winners[i] == winners[i])
            return 'X';
        if (O & winners[i] == winners[i])
            return 'O';
    }
    return ' ';
}

The big question here would be whether you really want to pass the X and O boards separately, or whether it makes more sense to pass an array of two shorts. The obvious advantage of using an array would be easier access to the opposite board. For example, to test whether a move is allowed in one board, you'd check whether that bit is set in the other board. With the boards stored in an array, you can be passed an n indicating the board where you want to make a move, and use 1-n to get the other board, where you'll check if that bit is already set.

Good stuff, The code was a randome "heres a for-loop" example. The code I was actually writing at the time I thought of this was 11 lines that would have taken either creatively nested for loops or 3 different sets of for-loops (basically, write 9 lines of code more or use for loops) and didn't have anything to do with cout. But this information is very useful indead! — gNerb, Dec 02 '13 at 05:19
The board is actually a struct that has a 2d array for the grid and several methods within it to manipulate and evaluate the board and the player tokens and what not are in a struct for the player. The players can pick any single character as their token. I did not think of using two separate "boards." A char array makes sense and is easy to "see" and this code is designed to be "on level" with the course. Maybe when I get back to my other computer Ill set up something to show the entire program but It was never my intention to discuss absolute code optimization. Seems a big topic for Q&A form — gNerb, Dec 02 '13 at 20:48

score 4 · Accepted Answer · answered Dec 02 '13 at 05:11

4

What you're talking about is called loop unwinding. The performance trade-offs are complex and depend on many aspects of both the compiler and the execution environment. See the Wikipedia article on loop unwinding for a discussion of the issues.

answered Dec 02 '13 at 05:11

Ted Hopp

232,168
48
399
521

fredoverflow · Answer 3 · 2013-12-02T19:12:55.393

By encoding which positions are part of which lines, you can perform the win check very efficiently:

char square[3][3] = {' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '};
char player = 'x';
unsigned progress[2];

const unsigned lines[3][3] = {
    0x10010010,
    0x10001000,
    0x10000101,

    0x01010000,
    0x01001011,
    0x01000100,

    0x00110001,
    0x00101000,
    0x00100110
};

The encoding is "top row, middle row, bottom row, left column, middle column, right column, downward diagonal, upward diagonal".

For example, the top-left position is part of the top row, the left column and the downward diagonal.

As soon as you have 3 pieces in the same line, the line is full and you win, so just keep adding the lines until you hit 3. You can recognize a 3 by two consecutive 1 bits, so p & (p >> 1) will be non-zero:

void make_move(int y, int x)
{
    square[y][x] = player;
    unsigned p = (progress[player & 1] += lines[y][x]);
    if (p & (p >> 1))
    {
        printf("player %c has won!\n", player);
        exit(0);
    }
    else
    {
        player = 'x' + 'o' - player;
    }
}

it's very interesting to see how far you can optimize code as simple as this. I'd like to mention that the method above sits within a struct called Grid and the Player is handled by a struct called Player (I did not use classes because that is the next course at my uni). The code above won't transpose perfectly unfortunately, but it's definitely something I have not thought of. Do you have any links to resources regarding encoding like this? — gNerb, Dec 02 '13 at 20:37

Kirill Kobelev · Answer 4 · 2013-12-02T05:30:59.327

2

When thinking about the loop unwinding it is necessary to estimate the weight ratio between the body of the loop and the loop organisation overhead.

It is true that even the simplest for loop will add several instructions overhead. But in your case complexity of the I/O call will overweight these instructions 10-100 times.

Unwinding makes sense when the body of the loop is doing some manipulation in the memory that is requires several, maybe a dozen of asm instructions. For example:

// Process digits starting fom the last one.
wchar_t carry_bit = 0;
while (curr_digit_offs >= 0)
{
    wchar_t ch = fpb[curr_digit_offs];
    fpb[curr_digit_offs--] = g_RawScan_MultiplyBy2[ch & 15] + carry_bit;
    carry_bit = (ch >= L'5') ? TRUE : FALSE;
}

In the example above the body of the loop is not calling any external function. It only works with data structures in memory. This means that its complexity can be estimated.

In every particular case separate estimation is needed.

edited Dec 02 '13 at 05:30

answered Dec 02 '13 at 05:18

Kirill Kobelev

10,252
6
30
51

Can you provide an simple example for your statement about manipulating memory? – gNerb Dec 02 '13 at 05:24
1

I made modification. Manipulating the memory means that it is not reading/writing database, sending something something over the network or to graphics co-processor, etc. – Kirill Kobelev Dec 02 '13 at 05:32
If you'd be so kind as to refer to the code I have updated on my post, I have updated it to reflect the code I was actually working on. I believe that I have made the right call in unwinding it have I not? – gNerb Dec 02 '13 at 05:40
1

In should notice a benefit because you use constant indexes that will be calculated during the compilation versus computing the offset inside the body of the loop. – Kirill Kobelev Dec 02 '13 at 05:45

C++: Loop Optimization and Loop Unwinding (To loop or not to loop)

Update:

4 Answers4