2

Suppose I have the following instruction - MOV X5, XZR
What part of the processor hardware would this MOV pseudo instruction use? What I mean is - does the MOV instruction require the use of the ALU or the Memory? It would obviously require accessing the register.

I am curious because I am going through the textbook "Computer Organization and Design" in which the authors discuss 2-issue processors. The requirement for 2 instruction to be in the same packet is that if one instruction is a Memory instruction, then the other must be a ALU/Logic or a branch. The instruction I mentioned above is followed by a Branch instruction, and I am not sure if the 2 instructions can be in the same packet.

If you could share some information about how this pseudo instruction is actually implemented that would be very helpful as well. Thanks for any help.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
rgbk21
  • 163
  • 10
  • 2
    I'm voting to close this question as off-topic because it is not a computer programming question. It's a CPU design question. Knowing which part of the hardware handles the MOV pseudo-instruction has no direct impact on how your write a program for the CPU. (It may have secondary impact if you're doing microbenchmarking, but that doesn't seem to be the question here.) – Raymond Chen Dec 03 '19 at 21:50
  • Ok, then what if I was interested in micro-benchmarking? Can I rephrase the question in that way? Would that meet the criteria for the community? – rgbk21 Dec 03 '19 at 21:57
  • For micro-benchmarking, show the two fragments of code you want to compare, measure the behavior, and present the results. Then ask for an explanation for the observed behavior. – Raymond Chen Dec 03 '19 at 22:01
  • 2
    It depends greatly on the implementation. It's unlikely that any ARMv8 CPU uses the same design as your text book does. In particular the Cortex-A53, the only ARMv8 dual-issue design I can find, doesn't seem to have any restrictions on pairing: "With A7 slot-0 was full-featured while slot-1 could only issue branch and integer data; now for A53, slot-1 can also issue load-stores and FP/NEON operations, bringing it up to parity with slot-0." https://www.anandtech.com/show/8718/the-samsung-galaxy-note-4-exynos-review/3 – Ross Ridge Dec 03 '19 at 22:23
  • From your question, it sounds like a branch can pair with anything. But anyway `mov reg,reg` or `mov reg, #imm` is normally considered ALU; if it needs an execution unit at all, it would be the ALU. Also, are you sure it's a pseudo-instruction? I thought AArch64 still has real hardware `mov`, not like MIPS where it would be something like `ori $dst, $zero, 0` – Peter Cordes Dec 04 '19 at 00:13
  • 1
    @PeterCordes A64 encodes register to register MOV as an ORR instruction similar to MIPS. http://infocenter.arm.com/help/topic/com.arm.doc.dui0802b/MOV_ORR_log_shift.html It probably doesn't actually execute as an ORR instruction, at least not on the out-of-order implementations. – Ross Ridge Dec 04 '19 at 01:58

2 Answers2

4

XZR is an alias for a register that always returns 0 and can't be changed to anything but 0. It's new in AArch64, but other RISCs like MIPS have always had a zero register. (32-bit ARM / Thumb ARMv8 mode is a different architecture that some AArch64 CPUs can also execute.)

Registers don't exist in memory and don't involve memory unless an instruction is moving data from memory to a register or vice versa.

This instruction is basically setting register X5 to zero by copying one register to another.

ARM was part of the whole "RISC" paradigm, with some practical efficiency compromises. AArch64 makes it even more RISCy, removing some ARM things that complicate modern superscalar pipelines, as well as widening registers to 64-bit. Some design principles of that RISC paradigm are:

  • A large number of registers are provided. AArch64 has 32 integer registers, up from 15 in ARM (not including the program counter). (That was still large compared to x86's 8 back in the day).
  • There are instructions to load and store data to/from registers (hence why RISC is also called "load-store architecture")
  • Other instructions such as ADD, SUB, etc. work on registers exclusively - there are limited register-with-memory operations. So things like "Add what's at memory location 1000 to register X" are not used - you have to "Load X2 with what's at memory location 1000" then "X = X + X2". (add reg, mem or even add mem,reg are classic CISC features that RISCs avoid.)

So given that legacy you'd probably put this instruction in the "ALU" category since it doesn't talk to memory at all, and it only operated on integer registers (not FP/vector). As far as the rest of the pipeline is concerned, it only reads and writes integer register values, not memory and doesn't branch.

But what ALU does on a CPU is: ALU takes inputs, performs an operation, then delivers it to an output. In RISC the input will always be registers.

With MOV, there is no operation, the inputs are simply delivered to the output. It could bypass the ALU, or for simplicity of data paths still go through the ALU with control signals that make it do something like OR with 0 so the value comes out unchanged.

As you can see the real world is not as neat as your textbook. I don't know how the pipeline in any given ARM CPU actually works.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
LawrenceC
  • 423
  • 2
  • 13
  • 1
    XZR is new with AArch64. 32-bit ARM doesn't have a zero register. 32-bit ARM only has 16 architectural registers and one of those is the program counter (not usable for anything else). It's a load-store machine but is significantly less RISCy than most; e.g. it has microcoded push/pop instructions that push an arbitrary set of registers (encoded as a bitfield). For ARM, practical considerations won over RISC purity in a few cases. – Peter Cordes Dec 04 '19 at 00:42
  • 1
    I added a bunch of AArch64 vs. ARM stuff to make your answer not wrong, but unfortunately that kind of bloated things. Maybe remove some of the history since AArch64 was only [announced in 2011](https://en.wikipedia.org/wiki/ARM_architecture#AArch64), its more-RISCy design being well adapted to execution by modern superscalar pipelines including in-order and out-of-order (removing predication except for a couple instructions, and removing load/store-multiple push/pop). And aimed at modern transistor budgets. (e.g. AdvSIMD with 32x 128-bit vector registers is baseline, like SSE2 for x86-64.) – Peter Cordes Dec 04 '19 at 01:14
  • 1
    Anyway, you might want to trim your answer down after my edit to not distract from the details about the key point: Not that it actually does anything in the ALU, but rather that it definitely doesn't need any other special handling; not a branch, not memory, not FP/SIMD. I don't know if any AArch64 CPUs can do "mov elimination" and handle `mov reg,reg` in the register rename stage with zero latency (the way modern x86 can); unlikely because it's a 3-operand ISA so unlike x86 it doesn't need a lot of `mov tmp, reg1` / `and tmp, reg2` to non-destructively compute `reg1 & reg2`. – Peter Cordes Dec 04 '19 at 01:18
  • Cheers, you're welcome. Nice answer BTW; I hadn't thought of a useful way to answer the question. Explaining load/store machine and register fundamentals might well be what the OP was missing. – Peter Cordes Dec 04 '19 at 01:24
  • 1
    Ross Ridge commented under the question that AArch64 MOV is actually encoded as an `ORR` with the zero register, just like MIPS does. http://infocenter.arm.com/help/topic/com.arm.doc.dui0802b/MOV_ORR_log_shift.html. But high-performance implementations may special-case that if they want. – Peter Cordes Dec 04 '19 at 03:25
  • @PeterCordes If possible - could you please link the source where you found the information that MOV might not always be implemented as ORR? Thank you. – rgbk21 Dec 04 '19 at 14:59
  • 1
    @rgbk21: It's always *encoded* as ORR in machine code; I'm just saying that implementations *might* recognize that special case of `ORR reg, xzr, reg` when decoding and do something different, either to save power or possibly even to run it with zero latency like x86 `mov` elimination. I don't know if any real implementations *do* treat it specially; they don't really need to and the benefit would be small. – Peter Cordes Dec 04 '19 at 15:12
3

The question really is not about any particular ISA, even though his example is using AArch64 instruction mnemonics, it is about CPU micro-architecture. In particular about a 2-way super-scalar, in-order micro-architecture. The answer is going to be for any particular micro-architecture "it depends" on whether 2 instructions can be scheduled concurrently. So depending on which design you look at, you'll get a different answer. Building a CPU involves many trade-offs to achieve a desired power, performance, and area target, which is why the answers will be different.

Since you are reading "Computer Organization and Design" which is an entry level CPU micro-architecture text-book, lets simplify the micro-architecture to something idealistic instead of concerning yourself with an industry design which at this point will likely only confuse you more. Assume your micro-architecture has 2 identical 3-stage pipes that can handle all operations in a single cycle with no bypass network. Your pipeline now looks like:

| Fetch0 | -> | Decode0 | -> | Execute+Writeback |
| Fetch1 | -> | Decode1 | -> | Execute+Writeback |

In this simplified case, the answer is during decode your two decoders must do register dependency analysis on both instructions. If the mov produces a register the branch consumes, they cannot execute together and you have to delay the branch until the mov executes, otherwise they can flow down the pipeline together.

Of course this decision of what can be paired or not gets more complicated in a real design with asymmetric execution resources, more pipeline stages, multi-cycle instructions, by-pass networks, de-coupled fetch/execute, and speculative execution to name a few micro-architecture tricks of the trade.

If you are interested in finding out whether a commercial design can pair two particular types of instructions together, you can always take a look at a design's software optimization guides if available to understand what resources each instruction uses. For example, here is the Arm Cortex A-55 Optimization Guide.