1

First of all, I'd like to say that I'm new to ASM and if this is a stupid question please excuse it.

I read in Agner Fog's microarchitecture manual about partial register stalls (this seems a little bit advanced, but I was curious why 32-bit instructions in 64-bit mode zero the top half of the register). Example 6.13 gives a solution for how to avoid a register stall. I am a bit confused about this still, why was an OR operation not used instead of the MOV, like such:

xor eax, eax
mov al, byte [mem8]
; or  al, byte [mem8] ; why not this?

I think the effect is the same. Do they both take the same amount of cycles per second? Is one more efficient than the other? Is there something "under the hood" that would make me prefer one over the other?

Alex Gh
  • 55
  • 5
  • 1
    Yes. You could do this. But why? – zx485 Jul 16 '20 at 19:13
  • Look up the codes online to be sure, but I believe they are both the same speed. If you could load the whole register instead of just 8-bits, you would be able to remove the xor, but that is also probably pipelined away pretty easily. – Michael Dorgan Jul 16 '20 at 19:19
  • 1
    In this particular case, when the destination register is `al`, the `mov` instruction is one byte shorter, see https://godbolt.org/z/v114od – Nate Eldredge Jul 16 '20 at 19:42
  • Usually it's best not to have instructions depend on a register's previous contents if you can avoid it, as it restricts the processor's ability to execute out of order, so in general you would tend to prefer the `mov`. I don't know if it matters in this specific case, though. – Nate Eldredge Jul 16 '20 at 19:45
  • @zx485 In fact, some RISC architectures implement register-to-register moves as or instructions with a zero immediate. – fuz Jul 16 '20 at 19:45
  • At any rate, I can't think of any mechanism by which the `or` could be *faster*, and if they are the same speed you would prefer the `mov` anyway for readability. – Nate Eldredge Jul 16 '20 at 19:55
  • 5
    Do note that the effects are not exactly equivalent; the `or` will set flags and the `mov` will not. That probably doesn't matter in most applications, though. – Nate Eldredge Jul 16 '20 at 19:56

1 Answers1

2

Partial register access in 64-bit mode

In 64-bit mode, the following rules apply when accessing registers with less than 64-bit:

  • If a 32-bit register is accessed, the upper 32 bits of the associated 64-bit register are cleared
  • If a 16- or 8-bit register is accessed, the upper 48 or 56 bits of the associated 64-bit register remain.

If only an 8-bit register is accessed, the old value of the associated 64-bit register must first be obtained, the 8-bit sub-register changed and then the new value saved.

Example 6.13 from Agner Fog's microarchitecture manual is not related to this, it is only an alternative to movzx, because this instruction is slow on older pentium processors.

mov or or?

The two lines

31 C0                   xor eax, eax
8A 05 ## ## ## ##       mov al, byte [mem8]

(opcodes on the left) are probably faster than if you replaced the second line with

0A 05 ## ## ## ##       or  al, byte [mem8]

since there is a depency to the previous line: Only when xor eax, eax has been calculated the new value in eax can be passed on to or. In addition, just as with the variant with mov, there may be a slowdown because only a partial register is accessed. Instead, I would suggest replacing these two lines with

0F B6 05 ## ## ## ##    movzx eax, byte [mem8]

This is one byte shorter than the previous approach and also just a single instruction that accesses a full 32-bit register. As Agner Fog said

The easiest way to avoid partial register stalls is to always use full registers and use MOVZX or MOVSX when reading from smaller memory operands.

fcdt
  • 2,371
  • 5
  • 14
  • 26
  • 3
    That's not a race condition, it's a simple data dependency. x86's execution model is serial, i.e. the machine gives the illusion of every instruction fully finishing before the next one starts. Thus no race. Note that on Sandybridge-family (at least Haswell and later), xor-zeroing has zero latency from when it's issued/renamed so that doesn't matter, and that `mov al, [mem]` is [also a merge so also has a dependency on the full register](//stackoverflow.com/q/45660139/) on that arch family. (It's also a merge on AMD; only old P6-family renames AL separately). Working on an answer.... – Peter Cordes Jul 16 '20 at 20:21