2

I have been programming some stuff on 16-bit DOS recently for fun. I see a lot of people mentioning that far pointers are slower than near pointers and to avoid them.

Why?

From an assembly point of view, this makes sense. There are several extra instruction involved. You have to store the old value of a segment register, store the new segment into a register, then move that into CS or DS (immediate mode stores aren't a valid opcode in 8086). You can then do whatever you need to do in that segment. After, you have to restore the old value.

It sounds like a lot, but in reality this doesn't eat up a lot of cycles. I guess if every pointer you were using were in a different segment this could add up, but data is usually grouped. So unless you were bouncing all over the place, which is slow for other reasons, the penalty shouldn't be that bad. If you have to hit DRAM, that should dominate the cost, right?

I feel like there is more to this story and I am having a hard time tracking it down. Hoping an 8086 wizard is hanging around who remembers this stuff.

To clarify: I am interested in actual 16-bit processors like the 8086 and the 80286 in real mode.

Karl Strings
  • 1,017
  • 9
  • 22
  • 3
    If you group your data, then the first pointer is far (the group) and the rest are based (relative to the group). So you avoided the slow far pointers by converting most of them to near. – Raymond Chen Jul 28 '22 at 04:17
  • 3
    *If you have to hit DRAM, that should dominate the cost, right?* - memory access on 8086 was the major factor in performance, but it didn't have a cache so locality is irrelevant, and the bus was only a word wide. (A byte on 8088, which was more common.) So loading a far pointer from memory cost 2 bus cycles (assuming it's aligned). And every instruction has to get fetched from memory, so optimizing for code-size was one of the most important things for speed. More / larger instructions costs speed. Re: 8086 performance, see [this Q&A](//stackoverflow.com/a/67403962) for insn tables + more – Peter Cordes Jul 28 '22 at 05:03
  • 2
    Your assessment is reasonable for far data that is grouped in the same segment. I will note that If you are programming in a higher level language (C) and write programs that use HUGE pointers then there is very significant cost when referencing memory. – Michael Petch Jul 28 '22 at 05:29
  • 2
    Are you asking about them being slow on CPUs relevant at the time, like 8086 up to maybe 486 or P5 Pentium? Or are you asking about them being slow on modern CPUs? Timing numbers may be harder to find on `mov Sreg, r/m` for real mode on modern CPUs, because the cost is probably different from Protected mode (where it triggers the CPU to index the GDT). Also, some CPUs rename segment registers, other's don't. In which case mov to Sreg has to at worst drain the ROB, or at least stall before issuing the next instruction that uses that Sreg (and thus all later). – Peter Cordes Jul 28 '22 at 07:49
  • 1
    [Is a mov to a segmentation register slower than a mov to a general purpose register?](https://stackoverflow.com/q/51163779) has some details from Agner Fog: Core2 and Nehalem (P6-family) rename segment registers. My testing shows evidence that Skylake doesn't. (That makes sense: when P6 was new and being designed, Pentium Pro in 1995, 16-bit code was still somewhat relevant. When Sandybridge was new, a decade and a half later, 16-bit was long obsolete.) – Peter Cordes Jul 28 '22 at 07:51
  • @Peter Cordes - I am more interested with actual 16-bit CPUs, like the 8086 and the 80286 – Karl Strings Jul 28 '22 at 13:46

1 Answers1

3

Why are far pointers slow?

Segment register loads are too complex for a core's front-end to convert into micro-ops. Instead they're "emulated" by micro-ops stored in a little ROM. This begins with some branches (which CPU mode is it?), and typically those branches can't benefit from CPU's branch prediction causing stall/s.

To avoid segment register loads (e.g. when the same far pointer is used multiple times) software tends to use more segment registers (e.g. tends to use ES, FS and GS); and this adds more prefixes (segment override prefixes) to instructions. These extra prefixes can also slow down instruction decoding.

I guess if every pointer you were using were in a different segment this could add up, but data is usually grouped.

Compilers aren't that smart. If a small piece of code uses 4 far pointers that all happen to use the same segment, the compiler won't know that they're all in the same segment and will do expensive segment register loads regardless. To work around that you could describe the data as a structure (e.g. 1 pointer to a structure that has 4 fields instead of 4 different pointers); but that requires the programmer to write software differently.

For an example; if you do something like "int foo(void) { int a; return bar(&a); }" then ss will probably be passed on the stack and then the callee (bar) will load that into another segment register, because bar() has to assume that the pointer could point anywhere.

The other problem is that sometimes data is larger than a segment (e.g. an array of 100000 bytes that doesn't fit in a 64 KiB segment); so someone (the programmer or the compiler) has to calculate and load a different segment to access (parts of) the same data. All pointer arithmetic may need to take this into account (e.g. a trivial looking pointer++; may become something more like offset++; segment += (offset >> 4); offset &= 0x000F; that causes a segment register load).

If you have to hit DRAM, that should dominate the cost, right?

For real mode; you're limited to about 640 KiB of RAM and the caches in CPUs are typically much larger, so you can expect that every memory access is going to be a cache hit. In some cases (e.g. cascade lake CPUs with 1 MiB of L2 cache per core) you won't even use the L3 cache (it'll be all L2 hits).

You can also expect that a segment register load is more expensive than a cache hit.

Hoping an 8086 wizard is hanging around who remembers this stuff.

When people say "segmentation is awful and should be avoided" they're not thinking of a CPU that's been obsolete for 40 years (8086), they're thinking of CPUs that were relevant during this century. Some of them may also be thinking of more than just performance alone (especially for assembly language programmers, segmentation is an annoyance/extra burden).

Brendan
  • 35,656
  • 2
  • 39
  • 66