5

Specifically is:

mov %eax, %ds

Slower than

mov %eax, %ebx

Or are they the same speed. I've researched online, but have been unable to find a definitive answer.

I'm not sure if this is a silly question, but I think it's conceivable modifying a segmentation register could make the processor do extra work.

N.B I'm concerned with old x86 linux cpus, not modern x86_64 cpus, where segmentation works differently.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Others
  • 2,876
  • 2
  • 30
  • 52
  • 3
    Yes, it's slower. Also, you can't load arbitrary values into segment registers in protected mode (in addition to being 16 bits in size). The instruction set manual at least has hints that indeed this makes the cpu do a lot of work, possibly including memory accesses: _"moving a segment selector into a segment register automatically causes the segment descriptor information associated with that segment selector to be loaded into the hidden (shadow) part of the segment register. [...] The segment descriptor data is obtained from the GDT or LDT entry for the specified segment selector."_ – Jester Jul 03 '18 at 23:02
  • 1
    Refer to [Agner's tables](http://www.agner.org/optimize/instruction_tables.pdf) for timings. Generally speaking, a move to a segment register is about 10–20 times slower than a move between general purpose registers. – fuz Jul 03 '18 at 23:12
  • @fuz I looked but I couldn't find it in my copy at least. Ah, it's not listed for all processors apparently. – Jester Jul 03 '18 at 23:16
  • @Jester It's `mov r,sr` resp. `mov m,sr`, `mov sr,r`, and `mov sr,m`. Seems to be there for most architectures. – fuz Jul 03 '18 at 23:18
  • Not all, looks like only for the old ones. For example, only AMD K7-10 have it, any other AMD doesn't. Or I am blind :) – Jester Jul 03 '18 at 23:23
  • There are x86 CPUs but not *x86 linux cpus*. Even in x86_64 CPUs, segmentation works the same in 32-bit and 16-bit mode – phuclv Jul 04 '18 at 03:29
  • I still remember Linus commented on `set_fs()` function when I was reading the early `Linux-0.11` source code: Since it is expensive to load a segment register, we try to avoid calling set_fs() unless we absolutely have to. – Li-Guangda Nov 20 '22 at 10:56

3 Answers3

7

mov %eax, %ebx between general-purpose registers is one of the most common instructions. Modern hardware supports it extremely efficiently, often with special cases that don't apply to any other instruction. On older hardware, it's always been one of the cheapest instructions.

On Ivybridge and later, it doesn't even need an execution unit and has zero latency. It's handled in the register-rename stage. Can x86's MOV really be "free"? Why can't I reproduce this at all? Even on earlier CPUs, it's 1 uop for any ALU port (so typically 3 or 4 per clock throughput).

On AMD Piledriver / Steamroller, mov r32,r32 and r64,r64 can run on AGU ports as well as ALU ports, giving it 4 per clock throughput vs. 2 per clock for add, or for mov on 8 or 16-bit registers (which have to merge into the destination).


mov to a segment reg is a fairly rare instruction in typical 32 and 64-bit code. It is part of what kernels do for every system call (and probably interrupts), though, so so making it efficient will speed up the fast-path for system-call and I/O intensive workloads. So even though it appears in only a few places, it can run a fair amount. But it's still of minor importance compared to mov r,r!

mov to a segment reg is slow: it triggers a load from the GDT or LDT to update the descriptor cache, so it's microcoded.

This is the case even in x86-64 long mode; the segment base/limit fields in the GDT entry are ignored, but it still has to update the descriptor cache with other fields from the segment descriptor, including the DPL (descriptor privilege level) which does apply to data segments.


Agner Fog's instruction tables list uop counts and throughput for mov sr, r (Intel synax, mov to segment reg) for Nehalem and earlier CPUs. He stopped testing seg regs for later CPUs because it's obscure and not used by compilers (or humans optimizing by hand), but the counts for SnB-family are probably somewhat similar. (InstLatx64 doesn't test seg regs either, e.g. not in this Sandybridge instruction-timing test)

MOV sr,r on Nehalem (presumably tested in protected mode or long mode):

  • 6 fused-domain uops for the front end
  • 3 uops for ALU ports (p015)
  • 3 uops for the load port (p2)
  • throughput: 1 per 13 cycles (for repeating this instruction thousands of times in a giant loop). IDK if the CPU renames segment regs. If not, it might stall later loads (or all later instructions?) until the descriptor caches were updated and the mov to sr instruction retires. i.e. I'm not sure how much impact this would have on out-of-order execution of surrounding code.

Other CPUs are similar:

  • PPro/PII/PIII (original P6): 8 uops for p0, no throughput listed. 5 cycle latency. (Remember this uarch was designed before it's 1995 release, when 16-bit code was still common. This is why P6-family does partial-register renaming for integer registers (AL,AH separate from AX))

  • Pentium 4: 4 uops + 4 microcode, 14c throughput.

    Latency = 12c 16-bit real or vm86 mode, 24c in 32-bit protected mode. 12c is what he lists in the main table, so presumably his latency numbers for other CPUs are real-mode latencies, too, where writing a segment reg just sets the base = sreg<<4.)

    Reading a segment reg is slow on P4, unlike other CPUs: 4 uops + 4 microcode, 6c throughput

  • P4 Prescott: 1 uop + 8 microcode. 27c throughput. Reading a segment reg = 8c throughput.

  • Pentium M: 8 uops for p0, same as PIII.

  • Conroe/Merom and Wolfdale/Penryn (first and second-gen Core2): 8 fused-domain uops, 4 ALU (p015), 4 load/AGU (p2). one per 16 cycle throughput, the slowest of any CPU where Agner tested it.

  • Skylake (my testing reloading them with the value I read outside the loop): in a loop with just dec/jnz: 10 fused-domain uops (front-end), 6 unfused-domain (execution units). one per 18c throughput.

    In a loop writing to 4 different seg regs (ds/es/fs/gs) all with the same selector: four mov per 25c throughput, 6 fused/unfused domain uops. (Perhaps some are getting cancelled?)

    In a loop writing to ds 4 times: one iter per 72c (one mov ds,eax per 18c). Same uop count: ~6 fused and unfused per mov.

    This seems to indicate that Skylake does not rename segment regs: a write to one has to finish before the next write can start.

  • K7/K8/K10: 6 "ops", 8c throughput.

  • Atom: 7 uops, 21c throughput

  • Via Nano 2000/3000: unlisted uops, 20 cycles throughput and latency. Nano 3000 has 0.5 cycle throughput for reading a seg reg (mov r, sr). No latency listed, which is weird. Maybe he's measuring seg-write latency in terms of when you can use it for a load? like mov eax, [ebx] / mov ds, eax in a loop?

Weird Al was right, It's All About the Pentiums

In-order Pentium (P5 / PMMX) had cheaper mov-to-sr: Agner lists it as taking ">= 2 cycles", and non-pairable. (P5 was in-order 2-wide superscalar with some pairing rules on which instructions could execute together). That seems cheap for protected mode, so maybe the 2 is in real mode and protected mode is the greater-than? We know from his P4 table notes that he did test stuff in 16-bit mode back then.


Agner Fog's microarch guide says that Core2 / Nehalem can rename segment registers (Section 8.7 Register renaming):

All integer, floating point, MMX, XMM, flags and segment registers can be renamed. The floating point control word can also be renamed.

(Pentium M could not rename the FP control word, so changing the rounding mode blocks OoO exec of FP instructions. e.g. all earlier FP instructions have to finish before it can modify the control word, and later ones can't start until after. I guess segment regs would be the same but for load and store uops.)

He says that Sandybridge can "probably" rename segment regs, and Haswell/Broadwell/Skylake can "perhaps" rename them. My quick testing on SKL shows that writing the same segment reg repeatedly is slower than writing different segment regs, which indicates that they're not fully renamed. It seems like an obvious thing to drop support for, because they're very rarely modified in normal 32 / 64-bit code.

And each seg reg is usually only modified once at a time, so multiple dep chains in flight for the same segment register is not very useful. (i.e. you won't see WAW hazards for segment regs in Linux, and WAR is barely relevant because the kernel won't use user-space's DS for any memory references in a kernel entry-point. (I think interrupts are serializing, but entering the kernel via syscall could maybe still have a user-space load or store in flight but not executed yet.)

In chapter 2, which explains out-of-order exec in general (all CPUs except P1 / PMMX), 2.2 register renaming says that "possibly segment registers can be renamed", but IDK if he means that some CPUs do and some don't, or if he's not sure about some old CPUs. He doesn't mention seg reg renaming in the PII/PII or Pentium-M sections, so I can't tell you about the old 32-bit-only CPUs you're apparently asking about. (And he doesn't have a microarch guide section for AMD before K8.)

You could benchmark it yourself if you're curious, with performance counters. (See Are loads and stores the only instructions that gets reordered? for an example of how to test for blocking out-of-order execution, and Can x86's MOV really be "free"? Why can't I reproduce this at all?) for basics on using perf on Linux to do microbenchmarks on tiny loops.


Reading a segment reg

mov from a segment reg is relatively cheap: it only modifies a GP register, and CPUs are good at writes to GP registers, with register-renaming etc. Agner Fog found it was a single uop on Nehalem. Fun fact, on Core2 / Nehalem it runs on the load port, so I guess that's where segment regs are stored on that microarchitecture.

(Except on P4: apparently reading seg regs was expensive there.)

A quick test on my Skylake (in long mode) shows that mov eax, fs (or cs or ds or whatever) is 2 uops, one of which only runs on port 1, and the other can run on any of p0156. (i.e. it runs on ALU ports). It has a throughput of 1 per clock, bottlenecked on port 1.

Not tested: interleaving with memory instructions, including cache-miss loads, to see if multiple dep chains could be in flight. So I really only tested throughput. If there's a throughput bottleneck other than the WAW hazard itself, that doesn't rule out tracking segment registers along with loads/stores. But that seems unlikely to be worth it for modern code: segment regs typically only change right before or after a privilege-level change that drains the out-of-order back end anyway, not mixed in with various loads/stores. Except maybe changing FS or GS base on context switches.


You normally only mess with FS or GS for thread-local storage, and you don't do it with mov to FS, you make a system call to have the OS use an MSR or wrfsbase to modify the segment base in the cached segment description. (Or if the OS allows and CPU supports, you could use wrfsbase in user-space.)


N.B I'm concerned with old x86 linux cpus, not modern x86_64 cpus, where segmentation works differently.

You said "Linux", so I assume you mean protected mode, not real mode (where segmentation works completely differently). Probably mov sr, r decodes differently in real mode, but I don't have a test setup where I can profile with performance counters for real or VM86 mode running natively.

FS and GS in long mode work basically the same as in protected mode, it's the other seg regs that are "neutered" in long mode. I think the Agner Fog's Core2 / Nehalem numbers are probably similar to what you'd see in a PIII in protected mode. They're part of the same microarchitecture family. I don't think we have a useful number for P5 Pentium segment register writes in protected mode.

(Sandybridge was the first of a new family derived from P6-family with significant internal changes, and some ideas from P4 implemented a different (better) way, e.g. SnB's decoded-uop cache is not a trace cache. But more importantly, SnB uses a physical register file instead of keeping values right in the ROB, so its register renaming machinery is different.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • re "mov to a segment reg is a fairly rare instruction": Yes it maybe rare in application code. But it would be interesting to know how often modern OSs read from and write to segment registers. It may not be that rare and probably depends on the dynamic behavior of the system. – Hadi Brais Jul 04 '18 at 13:42
  • @HadiBrais: That's what the rest of that paragraph says :P The first version of my answer just said it was rare and unimportant, but then I remembered that kernel use it in the entry / exit paths. And BTW, just updated with testing on Skylake. I got curious. Looks like SKL does *not* rename seg regs, because writing DS repeatedly is slower than writing DS/ES/FS/GS :) – Peter Cordes Jul 04 '18 at 13:48
  • Thanks for putting all of that info in one place and for the tests. – Hadi Brais Jul 04 '18 at 13:51
  • 1
    Your answer inspired me to update my answer to a [related question](https://stackoverflow.com/questions/49811461/why-segmentation-cannot-be-completely-disable). – Hadi Brais Jul 04 '18 at 13:53
  • Amazing answer. I appreciate the link to Fogs tables, they’re a great resource! I’ve accepted your answer—I’m blown away by its completeness! – Others Jul 04 '18 at 17:18
  • @Others: thanks. Understanding out-of-order execution (see Agner's microarch guide and optimization guide) is essential reading for *understanding* the tables, and what to do with uop counts. e.g. "throughput" numbers for a specific instruction are usually only relevant if it's worse than the front-end or execution-port pressure from its uops, because real code has a mix of instructions for different ports. – Peter Cordes Jul 04 '18 at 18:57
2

To add to what Peter said, a move between registers is just a case of changing the RAT pointer of the designation architectural register to the source architectural register when using the PRF scheme of Sandy Bridge and onwards, so there is no execution unit.

A move to a segment register is about 8 uops from the microsequencer. It also has a reciprocal throughput of 14 cycles on nehalem, which implies a pipeline flush occurs and it probably runs as a microcode assist. The microcode routine contains a memory load of the descriptor to a dedicated descriptor register as a destination in the RS (Reservation Station).

Moving to a segment register could be handled by a rename mechanism. The segment register could be renamed along with the descriptor and then a load from a logical address results in the descriptor being copied in the reservation station as a source as well as the offset register and is handled by an execution port with an AGU. This would potentially be wasteful in that the RS would have to have a descriptor field for every entry, where the DS segment would be read and copied into the RS identically for every entry. There are Intel patents that discuss this. There are suggestions that the RS can also have a separate entry for a segment register source or destination as well as a descriptor source or destination.

Alternatively, a move to a segment register can simply flush and serialise the pipeline, ensuring that all memory operations in the out of order core use the correct segment descriptor. This must happen for a change of the CS segment in a far call, because the decode stage depends on the fields of the descriptor for memory and operand sizes. For a mov, the AGU could read directly from the segment descriptor based on the segment override in the opcode field rather than having to read a renamed descriptor from the RS. A far jump may actually be done in line by the MSROM as opposed to retire, because predictions are not made for far jumps and it always mispredicts not-taken, which has the effect of decoder having the updated CS, as the CS and CS descriptor write completes before the pipeline is resteered to the correct linear address.

A load from a segment register is apparently not done by changing the RAT pointer; uops actually execute, suggesting that segment and integer registers have separate dedicated registers for rename. I would guess that they and control registers can't be renamed and have a single dedicated register that renames sources only.

Lewis Kelsey
  • 4,129
  • 1
  • 32
  • 42
  • `mov`-elimination is new in IvB, not first-gen SandyBridge. It also doesn't succeed 100% of the time, e.g. for back-to-back dependent mov instructions. [Can x86's MOV really be "free"? Why can't I reproduce this at all?](https://stackoverflow.com/q/44169342) has some more info. But yes, it's very cheap, and usually eliminated on modern Intel and AMD CPUs. – Peter Cordes Feb 01 '21 at 05:25
  • @PeterCordes I haven't looked into it but I would assume a move to a 32 bit register can't be eliminated if the destination architectural register currently points to a 64 bit register because it needs to be zeroed – Lewis Kelsey Feb 01 '21 at 11:13
  • Intel at least tracks when the upper bytes of a reg are known zero. It can even eliminate `movzx ecx, al`. (And I don't think that requires AL == RAX). So I guess it can update an upper-zero status for each RAT entry, or something like that. I haven't carefully tested this with registers that have non-zero upper halves, though. – Peter Cordes Feb 01 '21 at 18:30
  • @PeterCordes I read a patent about an unlamination decoder that tracked zeroing uops and then removes the zero operation from a fused zero+move in a following instruction i.e a write to `eax` if the register is already zeroed – Lewis Kelsey Feb 07 '21 at 05:55
1

As the question mentioned old x86 CPUs, we can go all the way back to 1985, with the original 80386. Its manual gives clock cycle counts for all instructions.

  • movl %reg, %reg: 2 clocks

  • movw %reg, %sreg in real mode: 2 clocks

  • movw %reg, %sreg in protected mode: 18 clocks

So yes, a lot slower.

I think it's conceivable modifying a segmentation register could make the processor do extra work.

The manual gives pseudocode for all the checks done when loading a segment register in protected mode, which takes up about a full printed page. Extra work, definitely.

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82