Performance of pIter != cont.end() in for loop

Question

I was getting through "Exceptional C++" by Herb Sutter lately, and I have serious doubts about a particular recommendation he gives in Item 6 - Temporary Objects.

He offers to find unnecessary temporary objects in the following code:

string FindAddr(list<Employee> emps, string name) 
{
  for (list<Employee>::iterator i = emps.begin(); i != emps.end(); i++)
  {
    if( *i == name )
    {
      return i->addr;
    }
  }
  return "";
}

As one of the example, he recommends to precompute the value of emps.end() before the loop, since there is a temporary object created on every iteration:

For most containers (including list), calling end() returns a temporary object that must be constructed and destroyed. Because the value will not change, recomputing (and reconstructing and redestroying) it on every loop iteration is both needlessly inefficient and unaesthetic. The value should be computed only once, stored in a local object, and reused.

And he suggests replacing by the following:

list<Employee>::const_iterator end(emps.end());
for (list<Employee>::const_iterator i = emps.begin(); i != end; ++i)

For me, this is unnecessary complication. Even if one replaces ugly type declarations with compact auto, he still gets two lines of code instead of one. Even more, he has this end variable in the outer scope.

I was sure modern compilers will optimize this piece of code anyway, because I'm actually using const_iterator here and it is easy to check whether the loop content is accessing the container somehow. Compilers got smarter within the last 13 years, right?

Anyway, I will prefer the first version with i != emps.end() in most cases, where I'm not so much worried about performance. But I want to know for sure, whether this is a kind of construction I could rely on a compiler to optimize?

Update

Thanks for your suggestions on how to make this useless code better. Please note, my question is about compiler, not programming techniques. The only relevant answers for now are from NPE and Ellioh.

Yes, this is true, because this is a find-temporaries example. I'm asking about particular part of this example. — Mikhail, Mar 15 '13 at 13:14
Instead of trying to write code that is as fast as possible, write code that is easy to read and easy to understand. Do not waste time optimizing your code unless it REALLY needs to be optimized. — LihO, Mar 15 '13 at 13:22
@LihO Yes, I totally agree with you. See my last paragraph. My question is not about that. — Mikhail, Mar 15 '13 at 13:26

Ellioh · Accepted Answer · 2013-03-15T17:02:59.200

UPD: The book you are speaking about has been published in 1999, unless I'm mistaking. That's 14 years ago, and in modern programming 14 years is a lot of time. Many recommendations that were good and reliable in 1999, may be completely obsolete by now. Though my answer is about a single compiler and a single platform, there is also a more general idea.

Caring about extra variables, reusing a return value of trivial methods and similar tricks of old C++ is a step back towards the C++ of 1990s. Trivial methods like end() should be inlined quite well, and the result of inlining should be optimized as a part of the code it is called from. 99% situations do not require manual actions such as creating an end variable at all. Such things should be done only if:

You KNOW that on some of the compilers/platforms you should run on the code is not optimized well.
It has become a bottleneck in your program ("avoid premature optimization").

I've looked at what is generated by 64-bit g++:

gcc version 4.6.3 20120918 (prerelease) (Ubuntu/Linaro 4.6.3-10ubuntu1)

Initially I thought that with optimizations on it should be ok and there should be no difference between two versions. But looks like things are strange: the version you considered non-optimal is actually better. I think, the moral is: there is no reason to try being smarter than a compiler. Let's see both versions.

#include <list>

using namespace std;

int main() {
  list<char> l;
  l.push_back('a');

  for(list<char>::iterator i=l.begin(); i != l.end(); i++)
      ;

  return 0;
}

int main1() {
  list<char> l;
  l.push_back('a');
  list<char>::iterator e=l.end();
  for(list<char>::iterator i=l.begin(); i != e; i++)
      ;

  return 0;
}

Then we should compile this with optimizations on (I use 64-bit g++, you may try your compiler) and disassemble main and main1:

For main:

(gdb) disas main
Dump of assembler code for function main():
   0x0000000000400650 <+0>: push   %rbx
   0x0000000000400651 <+1>: mov    $0x18,%edi
   0x0000000000400656 <+6>: sub    $0x20,%rsp
   0x000000000040065a <+10>:    lea    0x10(%rsp),%rbx
   0x000000000040065f <+15>:    mov    %rbx,0x10(%rsp)
   0x0000000000400664 <+20>:    mov    %rbx,0x18(%rsp)
   0x0000000000400669 <+25>:    callq  0x400630 <_Znwm@plt>
   0x000000000040066e <+30>:    cmp    $0xfffffffffffffff0,%rax
   0x0000000000400672 <+34>:    je     0x400678 <main()+40>
   0x0000000000400674 <+36>:    movb   $0x61,0x10(%rax)
   0x0000000000400678 <+40>:    mov    %rax,%rdi
   0x000000000040067b <+43>:    mov    %rbx,%rsi
   0x000000000040067e <+46>:    callq  0x400610 <_ZNSt8__detail15_List_node_base7_M_hookEPS0_@plt>
   0x0000000000400683 <+51>:    mov    0x10(%rsp),%rax
   0x0000000000400688 <+56>:    cmp    %rbx,%rax
   0x000000000040068b <+59>:    je     0x400698 <main()+72>
   0x000000000040068d <+61>:    nopl   (%rax)
   0x0000000000400690 <+64>:    mov    (%rax),%rax
   0x0000000000400693 <+67>:    cmp    %rbx,%rax
   0x0000000000400696 <+70>:    jne    0x400690 <main()+64>
   0x0000000000400698 <+72>:    mov    %rbx,%rdi
   0x000000000040069b <+75>:    callq  0x400840 <std::list<char, std::allocator<char> >::~list()>
   0x00000000004006a0 <+80>:    add    $0x20,%rsp
   0x00000000004006a4 <+84>:    xor    %eax,%eax
   0x00000000004006a6 <+86>:    pop    %rbx
   0x00000000004006a7 <+87>:    retq

Look at the commands located at 0x0000000000400683-0x000000000040068b. That's the loop body and it seems to be perfectly optimized:

   0x0000000000400690 <+64>:    mov    (%rax),%rax
   0x0000000000400693 <+67>:    cmp    %rbx,%rax
   0x0000000000400696 <+70>:    jne    0x400690 <main()+64>

For main1:

(gdb) disas main1
Dump of assembler code for function main1():
   0x00000000004007b0 <+0>: push   %rbp
   0x00000000004007b1 <+1>: mov    $0x18,%edi
   0x00000000004007b6 <+6>: push   %rbx
   0x00000000004007b7 <+7>: sub    $0x18,%rsp
   0x00000000004007bb <+11>:    mov    %rsp,%rbx
   0x00000000004007be <+14>:    mov    %rsp,(%rsp)
   0x00000000004007c2 <+18>:    mov    %rsp,0x8(%rsp)
   0x00000000004007c7 <+23>:    callq  0x400630 <_Znwm@plt>
   0x00000000004007cc <+28>:    cmp    $0xfffffffffffffff0,%rax
   0x00000000004007d0 <+32>:    je     0x4007d6 <main1()+38>
   0x00000000004007d2 <+34>:    movb   $0x61,0x10(%rax)
   0x00000000004007d6 <+38>:    mov    %rax,%rdi
   0x00000000004007d9 <+41>:    mov    %rsp,%rsi
   0x00000000004007dc <+44>:    callq  0x400610 <_ZNSt8__detail15_List_node_base7_M_hookEPS0_@plt>
   0x00000000004007e1 <+49>:    mov    (%rsp),%rdi
   0x00000000004007e5 <+53>:    cmp    %rbx,%rdi
   0x00000000004007e8 <+56>:    je     0x400818 <main1()+104>
   0x00000000004007ea <+58>:    mov    %rdi,%rax
   0x00000000004007ed <+61>:    nopl   (%rax)
   0x00000000004007f0 <+64>:    mov    (%rax),%rax
   0x00000000004007f3 <+67>:    cmp    %rbx,%rax
   0x00000000004007f6 <+70>:    jne    0x4007f0 <main1()+64>
   0x00000000004007f8 <+72>:    mov    (%rdi),%rbp
   0x00000000004007fb <+75>:    callq  0x4005f0 <_ZdlPv@plt>
   0x0000000000400800 <+80>:    cmp    %rbx,%rbp
   0x0000000000400803 <+83>:    je     0x400818 <main1()+104>
   0x0000000000400805 <+85>:    nopl   (%rax)
   0x0000000000400808 <+88>:    mov    %rbp,%rdi
   0x000000000040080b <+91>:    mov    (%rdi),%rbp
   0x000000000040080e <+94>:    callq  0x4005f0 <_ZdlPv@plt>
   0x0000000000400813 <+99>:    cmp    %rbx,%rbp
   0x0000000000400816 <+102>:   jne    0x400808 <main1()+88>
   0x0000000000400818 <+104>:   add    $0x18,%rsp
   0x000000000040081c <+108>:   xor    %eax,%eax
   0x000000000040081e <+110>:   pop    %rbx
   0x000000000040081f <+111>:   pop    %rbp
   0x0000000000400820 <+112>:   retq

The code for the loop is similar, it is:

   0x00000000004007f0 <+64>:    mov    (%rax),%rax
   0x00000000004007f3 <+67>:    cmp    %rbx,%rax
   0x00000000004007f6 <+70>:    jne    0x4007f0 <main1()+64>

But there is alot of extra stuff around the loop. Apparently, extra code has made the things WORSE.

Wow, that's great! Could you do the same check for other containers, at least for `std::vector` please? :) — Mikhail, Mar 15 '13 at 13:52
For vector I see no difference between main() and main1(), and the empty loop is totally eliminated by the optimizer in both cases. — Ellioh, Mar 15 '13 at 14:02
Yeah, that's why I think the question is a bit tricky and a little vague. — Mikhail, Mar 15 '13 at 14:07
Same as with NPE answer, this is focusing on an implementation of the standard library where the iterator is a thin wrapper around a pointer. That is not the case with other library implementations, and there will be differences there. — David Rodríguez - dribeas, Mar 15 '13 at 14:27
@David Rodríguez - dribeas, that's true. But what I wanted to show is that in real cases the result of creating an extra local variable may not be worth several extra bytes of code it requires. The concrete example is just a single case. But the general idea is that there is no need to try doing compiler's job unless you already know that it is done poorly. — Ellioh, Mar 15 '13 at 14:35
... yet this answer does not answer the question of whether you should care or not, only that in this particular instance you don't. — David Rodríguez - dribeas, Mar 15 '13 at 14:55
This is the situation where I'd like to accept more than one answer. However, yours was the first and the rationale in update section seems clear enough. Accepted! — Mikhail, Mar 15 '13 at 20:59

score 8 · Answer 2 · answered Mar 15 '13 at 13:22

I've compiled the following slightly hacky code using g++ 4.7.2 with -O3 -std=c++11, and got identical assembly for both functions:

#include <list>
#include <string>

using namespace std;

struct Employee: public string { string addr; };

string FindAddr1(list<Employee> emps, string name)
{
  for (list<Employee>::const_iterator i = emps.begin(); i != emps.end(); i++)
  {
    if( *i == name )
    {
      return i->addr;
    }
  }
  return "";
}

string FindAddr2(list<Employee> emps, string name)
{
  list<Employee>::const_iterator end(emps.end());
  for (list<Employee>::const_iterator i = emps.begin(); i != end; i++)
  {
    if( *i == name )
    {
      return i->addr;
    }
  }
  return "";
}

In any event, I think the choice between the two versions should be primarily based on grounds of readability. Without profiling data, micro-optimizations like this to me look premature.

That's what I was asking for! But I'd prefer some comments for, say, compiler developers. Because this is a model and simple example. Who knows what compiler will decide in more complicated cases? — Mikhail, Mar 15 '13 at 13:24
This answer is only dealing with the specific implementation of a standard library with no safe-iterators. Try this with the Dinkumware standard library that ships with VS and the result of the test will differ (the iterator is not a thin wrapper around a pointer, but does more logic, stores enough information to track whether the iterator becomes invalidated at a later time). — David Rodríguez - dribeas, Mar 15 '13 at 14:26

score 4 · Answer 3 · answered Mar 15 '13 at 14:59

Contrary to popular belief, I don't see any difference between VC++ and gcc in this respect. I did a quick check with both g++ 4.7.2 and MS C++ 17 (aka VC++ 2012).

In both cases I compared the code generated with the code as in the question (with headers and such added to let it compile), to the following code:

string FindAddr(list<Employee> emps, string name) 
{
    auto end = emps.end();
    for (list<Employee>::iterator i = emps.begin(); i != end; i++)
    {
        if( *i == name )
        {
            return i->addr;
        }
    }
    return "";
}

In both cases the result was essentially identical for the two pieces of code. VC++ includes line-number comments in the code, which changed because of the extra line, but that was the only difference. With g++ the output files were identical.

Doing the same with std::vector instead of std::list, gave pretty much the same result -- no significant difference. For some reason, g++ did switch the order of operands for one instruction, from cmp esi, DWORD PTR [eax+4] to cmp DWORD PTR [eax+4], esi, but (again) this is utterly irrelevant.

Bottom line: no, you're not likely to gain anything from manually hoisting the code out of the loop with a modern compiler (at least with optimization enabled -- I was using /O2b2 with VC++ and /O3 with g++; comparing optimization with optimization turned off seems pretty pointless to me).

Is that with checked iterators or with unsafe iterators? Additionally, what happens if the list is passed by reference? — David Rodríguez - dribeas, Mar 15 '13 at 15:31
@DavidRodríguez-dribeas: checked (I believe -- I didn't do anything to disable them anyway). — Jerry Coffin, Mar 15 '13 at 15:32
@DavidRodríguez-dribeas: Passing by reference doesn't seem to make any difference either. — Jerry Coffin, Mar 15 '13 at 15:53
why didn't anyone checked variant with second in-loop variable? `for (list::iterator i = emps.begin(), end = emps.end(); i != end; i++)`? Also this should be tested in run-time, same assembler code may take different time to run if just something had changed. Including the size of data. Actually, longer assembler code might be much faster one (some extended operations are way slower or have costs which appear only when certain conditions met). — Swift - Friday Pie, Apr 15 '19 at 08:46

score 3 · Answer 4 · answered Mar 15 '13 at 13:16

3

A couple of things... the first is that in general the cost of building an iterator (in Release mode, unchecked allocators) is minimal. They are usually wrappers around a pointer. With checked allocators (default in VS) you might have some cost, but if you really need the performance, after testing rebuild with unchecked allocators.

The code need not be as ugly as what you posted:

for (list<Employee>::const_iterator it=emps.begin(), end=emps.end(); 
                                    it != end; ++it )

The main decision on whether you want to use one or the other approaches should be in terms of what operations are being applied to the container. If the container might be changing it's size then you might want to recompute the end iterator in each iteration. If not, you can just precompute once and reuse as in the code above.

answered Mar 15 '13 at 13:16

David Rodríguez - dribeas

204,818
23
294
489

Nice thought about hiding this `end` inside `for`-declaration, thanks! Unfortunately, this isn't the answer to my question. – Mikhail Mar 15 '13 at 13:37
@Mikhail: Then I have missed what the question is... the first paragraph in the answer tackles the possible cost of copying, the second the issue with *ugliness* in the code. The third provides a rationale for when the cost is different... – David Rodríguez - dribeas Mar 15 '13 at 14:23
Then again, some tests by [Jerry Coffin here](http://stackoverflow.com/a/15435579/36565) seem to indicate that the compiler optimized and called the `end()` member function only once. – David Rodríguez - dribeas Mar 15 '13 at 19:54

Michael Kristofik · Answer 5 · 2013-03-15T13:33:03.553

3

If you really need the performance, you let your shiny new C++11 compiler write it for you:

for (const auto &i : emps) {
    /* ... */
}

Yes, this is tongue-in-cheek (sort of). Herb's example here is now out of date. But since your compiler doesn't support it yet, let's get to the real question:

Is this a kind of construction I could rely on a compiler to optimize?

My rule of thumb is that the compiler writers are way smarter than I am. I can't rely on a compiler to optimize any one piece of code, because it might choose to optimize something else that has a bigger impact. The only way to know for sure is to try out both approaches on your compiler on your system and see what happens. Check your profiler results. If the call to .end() sticks out, save it in a separate variable. Otherwise, don't worry about it.

edited Mar 15 '13 at 13:33

answered Mar 15 '13 at 13:20

Michael Kristofik

34,290
15
75
125

I do like range-fors, but Visual Studio does not yet :( – Mikhail Mar 15 '13 at 13:22
@Mikhail: Range-based for loops are one of the many new and enhanced features in [Visual Studio 2012](http://msdn.microsoft.com/en-us/library/vstudio/hh409293.aspx). – Blastfurnace Mar 15 '13 at 17:47
@Blastfurnace Oh, I did see this page, but since I'm currently using VS2010 I just forgot it. Thank you for the link! – Mikhail Mar 15 '13 at 20:52
+1 from 2020, it appears that I no longer have to code ugly for loop with iterators declaration in my VS2017. – manuell Jan 15 '20 at 18:06

score 2 · Answer 6 · answered Mar 15 '13 at 13:15

Containers like vector returns variable, which stores pointer to the end, on end() call, that optimized. If you've written container which does some lookups, etc on end() call consider writing

for (list<Employee>::const_iterator i = emps.begin(), end = emps.end(); i != end; ++i)
{
...
}

for speed

score 0 · Answer 7 · answered Mar 15 '13 at 13:25

0

Use `std` algorithms

He's right of course; calling end can instantiate and destroy a temporary object, which is generally bad.

Of course, the compiler can optimise this away in a lot of cases.

There is a better and more robust solution: encapsulate your loops.

The example you gave is in fact std::find, give or take the return value. Many other loops also have std algorithms, or at least something similar enough that you can adapt - my utility library has a transform_if implementation, for example.

So, hide loops in a function and take a const& to end. Same fix as your example, but much much cleaner.

answered Mar 15 '13 at 13:25

Alex Chamberlain

4,147
2
22
49

I understand this. Of course, I wouldn't have written such a piece of code for myself, I would have used `std::find`, as you said. – Mikhail Mar 15 '13 at 13:31
@Mikhail I think this is the more general solution; to take a `const&` to the offending temporary. – Alex Chamberlain Mar 15 '13 at 13:33

Performance of pIter != cont.end() in for loop

7 Answers7

Use `std` algorithms

Linked

Related

Performance of pIter != cont.end() in for loop

7 Answers7

Use std algorithms

Linked

Related

Use `std` algorithms