63

There're two well-known ways to set an integer register to zero value on x86.

Either

mov reg, 0

or

xor reg, reg

There's an opinion that the second variant is better since the value 0 is not stored in the code and that saves several bytes of produced machine code. This is definitely good - less instruction cache is used and this can sometimes allow for faster code execution. Many compilers produce such code.

However there's formally an inter-instruction dependency between the xor instruction and whatever earlier instruction that changes the same register. Since there's a depedency the latter instruction needs to wait until the former completes and this could reduce the processor units load and hurt performance.

add reg, 17
;do something else with reg here
xor reg, reg

It's obvious that the result of xor will be exactly the same regardless of the initial register value. But it the processor able to recognize this?

I tried the following test in VC++7:

const int Count = 10 * 1000 * 1000 * 1000;
int _tmain(int argc, _TCHAR* argv[])
{
    int i;
    DWORD start = GetTickCount();
    for( i = 0; i < Count ; i++ ) {
        __asm {
            mov eax, 10
            xor eax, eax
        };
    }
    DWORD diff = GetTickCount() - start;
    start = GetTickCount();
    for( i = 0; i < Count ; i++ ) {
        __asm {
            mov eax, 10
            mov eax, 0
        };
    }
    diff = GetTickCount() - start;
    return 0;
}

With optimizations off both loops take exactly the same time. Does this reasonably prove that the processor recognizes that there's no dependency of xor reg, reg instruction on the earlier mov eax, 0 instruction? What could be a better test to check this?

starblue
  • 55,348
  • 14
  • 97
  • 151
sharptooth
  • 167,383
  • 100
  • 513
  • 979
  • 2
    I think this is why we use high-level languages. If you really want to know, just change the codegen stage to do one or the other. Benchmark. Pick the best. – jrockway Jul 16 '09 at 06:13
  • 3
    ah, the old `xor reg, reg` trick - good old times :) – Nick Dandoulakis Jul 16 '09 at 06:23
  • 2
    I think the x86 architecture explicitly defines XOR reg,reg as breaking the dependency on reg. See the Intel architecture manual. I'd expect MOV reg,... to do the same thing simply because it is a MOV. So your real choice is, which one takes less space (I'd guess execution time is the same), if you don't care about status bits (XOR damages them all). – Ira Baxter Jul 20 '09 at 20:00
  • 1
    your `Count` variable is overflow, so the loops will run for a much less cycles than you expected – phuclv Dec 06 '13 at 15:04
  • 2
    On more recent micro-architectures, `xor reg,reg` doesn't require an execution unit (handled in decode?). It breaks dependencies on `reg`, and partial flags update stalls. And it has a smaller encoding. There's no good reason for the `mov` approach on recent x86-64, unless you have to preserve the [e]flags. – Brett Hale Feb 09 '14 at 23:17
  • 1
    There are several subtle advantages beyond code-size to using a recognized zeroing idiom like `xor`, compared to `mov`. I wrote an answer on a more recent question before I saw this one: http://stackoverflow.com/questions/33666617/which-is-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and. I think it's a better and more complete answer than any of these. Ideally they should be marked as duplicates of each other. – Peter Cordes Jan 19 '16 at 05:29

6 Answers6

34

an actual answer for you:

Intel 64 and IA-32 Architectures Optimization Reference Manual

Section 3.5.1.7 is where you want to look.

In short there are situations where an xor or a mov may be preferred. The issues center around dependency chains and preservation of condition codes.

In processors based on Intel Core microarchitecture, a number of instructions can help clear execution dependency when software uses these instruction to clear register content to zero.

In contexts where the condition codes must be preserved, move 0 into the register instead.

Violet Giraffe
  • 32,368
  • 48
  • 194
  • 335
Mark
  • 2,932
  • 18
  • 15
  • It doesn't sound like the quoted text recommends using a MOV in any situation. – mwfearnley May 07 '16 at 12:44
  • @mwfearnley Unfortunately Addison decided to edit my answer and cherry pick a subset of the content, it's unclear why this was done. You should read the full docs which cover situations where mov is preferred. – Mark May 09 '16 at 13:30
  • Thanks for clarifying. I guess it was an attempt to avoid the problem with the document moving/changing, but unfortunately the quote didn't contain all the points it needed.. I can see now from that section, it says to use MOV when you want to avoid setting the condition codes. – mwfearnley May 09 '16 at 13:58
  • @mwfearnley: It's rare that you can't just xor-zero ahead of setting flags. See [my answer on the more recent `xor` question](http://stackoverflow.com/questions/33666617/which-is-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and) for some suggestions on ways to avoid `mov reg, 0` in preparation for `setcc`. (And for more details on all the advantages of xor-zeroing). `mov reg,0` / `setcc` is terrible on older Intel CPUs, where reading the full reg causes a partial-register stall that `xor` would avoid. – Peter Cordes May 09 '16 at 18:58
18

On modern CPUs the XOR pattern is preferred. It is smaller, and faster.

Smaller actually does matter because on many real workloads one of the main factors limiting performance is i-cache misses. This wouldn't be captured in a micro-benchmark comparing the two options, but in the real world it will make code run slightly faster.

And, ignoring the reduced i-cache misses, XOR on any CPU in the last many years is the same speed or faster than MOV. What could be faster than executing a MOV instruction? Not executing any instruction at all! On recent Intel processors the dispatch/rename logic recognizes the XOR pattern, 'realizes' that the result will be zero, and just points the register at a physical zero-register. It then throws away the instruction because there is no need to execute it.

The net result is that the XOR pattern uses zero execution resources and can, on recent Intel CPUs, 'execute' four instructions per cycle. MOV tops out at three instructions per cycle.

For details see this blog post that I wrote:

https://randomascii.wordpress.com/2012/12/29/the-surprising-subtleties-of-zeroing-a-register/

Most programmers shouldn't be worrying about this, but compiler writers do have to worry, and it's good to understand the code that is being generated, and it's just frickin' cool!

Bruce Dawson
  • 3,284
  • 29
  • 38
  • Great writeup! I wonder if the same pattern exists on Thumb. – Asti Jan 16 '21 at 11:49
  • 1
    It is quite likely that the same optimization exists on Thumb. The optimization is applicable to any out-of-order processor and should save power and sometimes improve performance. But, I don't know. – Bruce Dawson Jan 17 '21 at 20:23
13

x86 has variable-length instructions. MOV EAX, 0 requires one or two more bytes in code space than XOR EAX, EAX.

ajs410
  • 2,384
  • 2
  • 19
  • 14
  • 9
    `mov eax, 0` is 5 bytes: one for the `mov eax, imm32` opcode, and 4 for the 4B of immediate data. `xor eax, eax` is 2 bytes: one `xor r32, r/m32` opcode, one for operands. – Peter Cordes Dec 12 '15 at 00:15
12

I stopped being able to fix my own cars after I sold my 1966 HR station wagon. I'm in a similar fix with modern CPUs :-)

It really will depend on the underlying microcode or circuitry. It's quite possible that the CPU could recognise "XOR Rn,Rn" and simply zero all bits without worrying about the contents. But of course, it may do the same thing with a "MOV Rn, 0". A good compiler will choose the best variant for the target platform anyway so this is usually only an issue if you're coding in assembler.

If the CPU is smart enough, your XOR dependency disappears since it knows the value is irrelevant and will set it to zero anyway (again this depends on the actual CPU being used).

However, I'm long past caring about a few bytes or a few clock cycles in my code - this seems like micro-optimisation gone mad.

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • 4
    Regardless of whether it is excessive optimization for practical use, there may be value to understanding that not all similar instructions are created equal. ;) – jerryjvl Jul 16 '09 at 06:18
  • 3
    @jerryjvl - It's also useful to realized that modern desktop x86 CPU's don't run x86 machine code - they decode the x86 into a RISC like internal instructions to execute. As such, they can recognize common code sequences (like xor eax, eax) and translate them into simpler instructions, like maybe some "clear reg" instruction instead. An actual xor is probably not done in this case. – Michael Jul 16 '09 at 06:35
  • 1
    micro-optimization may need to go mad when you're writing an MBR =). – brianmearns Mar 25 '13 at 01:04
  • @sh1ftst0rm : only unclever people do such things these days. – Daniel Kamil Kozar May 06 '14 at 16:03
2

I think on earlier architectures the mov eax, 0 instruction used to take a little longer than the xor eax, eax as well... cannot recall exactly why. Unless you have many more movs however I would imagine you're not likely to cause cache misses due to that one literal stored in the code.

Also note that from memory the status of the flags is not identical between these methods, but I may be misremembering this.

jerryjvl
  • 19,723
  • 7
  • 40
  • 55
-9

Are you writing a compiler?

And on a second note, your benchmarking probably won't work, since you have a branch in there that probably takes all the time anyway. (unless your compiler unrolls the loop for you)

Another reason that you can't benchmark a single instruction in a loop is that all your code will be cached (unlike real code). So you have taken much of the size difference between mov eax,0 and xor eax,eax out of the picture by having it in L1-cached the whole time.

My guess is that any measurable performance difference in the real world would be due to the size difference eating up the cache, and not due to execution time of the two options.

Thomas
  • 4,208
  • 2
  • 29
  • 31
  • 15
    This entire website has a "who cares" quality to the rest of the world. I don't think that would be a good answer. – Roman Starkov Jan 21 '11 at 10:47
  • Seems you and others are focusing on what I guess you perceive to be offensive. I have removed that part since I think you and others never read past that and just downvoted. – Thomas Jul 09 '19 at 17:05
  • For Sandybridge / Ivybridge, you can pretty easily construct a loop that runs at 1 iteration per clock with `nop` or `xor same,same`, but bottlenecks on ALU execution unit throughput with `mov reg,0`. Later Intel CPUs have 4 ALU execution units, so a concrete example of xor-zeroing elimination making a measurable difference other than code-size is a lot less easy to construct. (`xorps` zeroing of xmm/ymm regs is still easy, because there are fewer vector ALU ports than the front-end width). And AMD CPUs don't eliminate the back-end uop, so the advantage is really just code-size. – Peter Cordes Jul 10 '19 at 01:24
  • Most code does get L1i cache hits most of the time. L1i cache misses do happen, but *most* of the instructions executed over the course of a program do come from L1i cache, or even the smaller/faster uop cache. Most programs spend a lot of their time in small to medium sized loops. Caches work. – Peter Cordes Jul 10 '19 at 01:26
  • You're right that the OP's attempt to benchmark is unlikely to work, though. But it might on Sandybridge, if the loop overhead is 2 extra ALU uops making 4 total front-end uops. If one of them is an xor-zeroing that can be eliminated, the backend can handle it. – Peter Cordes Jul 10 '19 at 01:29
  • I agree with everything you said. I'm not up to speed on the existence of AGI stalls or equivalent in modern Intel CPUs in addition to ALU bottlenecks, but the point remains the same: You can't benchmark an instruction in the way that op does. It depends on all of the code around it, and the branch is only part of it. My point that the methodology and "why?" remains the same, and is only reinforced but what you added. – Thomas Jul 11 '19 at 09:23