1

I'm trying to optimize a branch (a switch...case like) at its max to emulate an X CPU on an x86 CPU. I thought of this: In memory I'll load blocks of x86 opCodes with a fixed length of 0x100 bytes, like this:

first block 
0
...[my code, jump at 0x10000, nop nop nop until 0x9F...]
0x9F
second block 
0x100
...
019F
third block
0x200
...
0x29F
...
etc, etc
...
0x10000

which will be finite, start at memory $0 (+ maybe some offset) and end at $0xFFFF (like a "rom" of 0x10000 size). Now each time an X CPU opCode is being fetched and emulated I'll do this: shift it left by 8 bits and jump to that location. Execute this, and continue my program flow normally. My questions are: 1) Is this even possible to be so tight with those opCode blocks? 2) Was that a common practice in the past?

venge
  • 177
  • 1
  • 9
  • _"Is this even possible to be so tight with those opCode blocks?"_ That's kind of hard for us to know. Anyway, a switch/case with cases going from 0..255 is likely to be optimized by the compiler to an indirect jump with the case number as an index into a jump table. You can study the assembly output from your compiler to see if you think it's worth trying to hand-optimize anything. – Michael Jun 03 '15 at 11:12
  • 3
    While this is certainly possible, a compiler would just keep a table of jump addresses, index into it with the switch statement, and jump indirectly. Your way *might* be somewhat faster, but it wastes memory (unless your stubs actually need the 0x100 bytes), and the *real* cost is the jump anyway, since it needs jump target prediction, which tends to suck. – EOF Jun 03 '15 at 11:13
  • yeah they surely need all of the 0x100bytes blocks from 0 to 0x10000, and it's a merely 64kb fixed code, since it acts "like a rom". But what prediction? It's all determined since they are fixed blocks with the same length. But ok. Now i see i have misinformed everyone. I will edit the question. It's "LIKE" a switch case. – venge Jun 03 '15 at 13:25
  • Also, I'm not even sure if a compiler is needed for this. The compiler would certainly optimize out stuff, and ruin the 0x100 length of the blocks, destroying the correct memory offsets. I think this is programming at its purest form. – venge Jun 03 '15 at 13:43
  • 1
    "Programming at its purest form?" That's one way to describe it. But most of us would call it silly, or worse. You're wasting a whole lot of memory for little to no performance gain. You're going to get a very large number of page faults. You could do this instead with a lookup table of 256 entries that point to the functions to be executed. That's going to require one more memory access than your shift-and-jump, but overall it will use much less memory and incur considerably fewer page faults. It's also much clearer code. – Jim Mischel Jun 03 '15 at 21:03
  • ah, I guess you are right... Thank all of you guys for clarifying this. – venge Jun 03 '15 at 22:17

1 Answers1

1

If you are branching across 256 opcodes through a switch block, you're going to be doing an indirect jump which the CPU cannot predict well, and that will get you a pipeline break on every opcode.

If the work to emulate an opcode is fair size, then this pipeline break may not matter a lot. I suspect it does; a "load register" opcode is simulated by essentially just a memory read, which isn't lot of work.

You might buy some visible improvement, by adding special tests just before the switch block, that check for the two or three most frequent opcodes (probably LOAD, CMP, JMP conditional) [If opcode = JMP then ...] These tests the CPU can typically predict well. If you do this, measure, measure, measure.

A sleazier trick is to amortize the cost of the pipeline break across multiple instructions if you can. If the machine has a lot of single-byte opcodes, you might consider doing a 65536 way branch across the next two opcode bytes. (Now you have to code a lot of switch cases, but many of them are essentially the same. Wonder if your compiler can handle it?) In the abstract, this cuts the pipeline break costs by a factor of two. For a specific machine with a very regular instruction set this may not buy you a lot.

However, you may not have a lot of single-byte opcodes, but you may need to decode one or more bytes for each instruction. The x86 is like this (prefix, opcode, MODRM, SIB, offset...). The big switch case should work pretty well for this.

It probably is good to align each switch case on a cache line boundary. If the instruction emulation is simple, the code will likely fit in the cache line and so the memory sees only one fetch for that cache line. If you don't do this, your instruction emulations will have a higher chance of crossing a cache line boundary, raising the memory fetch costs to two. This may not matter for frequently executed instructions, but code for rarely executed instructions may fall out the cache. This will help when you actually encounter one of these.

Last bit of advice: measure, measure, measure.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • I'd actually try to keep the code as small possible to keep as much as possible in the cache. – Ross Ridge Jun 03 '15 at 14:14
  • If you have 65536 "cache lines" of opcode interpreter, at 64 bytes each, you use only about 1Mb of cache. That's small enough so most of it will stay in the cache, leaving plenty of cache space on modern CPUs for other cachable items. – Ira Baxter Jun 03 '15 at 14:43
  • _the CPU cannot predict well_ - the prediction lies to the fact, I must find the first offset of the first block, after compiling the whole emulator? Because even if the jump is indirect, in my mind it's well determined (jump(opcode << 8 + offset)). I might be totally wrong though. – venge Jun 03 '15 at 14:43
  • 1
    @venge: The CPU can predict the location in the table, yes, it computes that directly from the jmp opcode. But it cannot predict the table entry, e.g., the effective target of the jmp, without fetching that entry and that takes time. That's where the pipeline break comes. – Ira Baxter Jun 03 '15 at 14:50
  • _You might buy some visible improvement, by adding special tests just before the switch block, that check for the two or three most frequent opcodes_ - I have called an empty switch...case compiled with gcc 17000 times and it ran in 92nanos. Then just adding a branch for halt emulation check it went to 115nanos. Adding extra checks would certainly speed up things, as long as I stumble upon common instructions, add, ld, pop, etc. Or else it can turn against me. – venge Jun 03 '15 at 14:51
  • @Ira I would try to keep the entire interpreter loop in the L1 cache. That should be easily achievable with a 64k L1 cache (which is as big as they get) if you use a simple jump table, but not with a blocking scheme. – Ross Ridge Jun 03 '15 at 14:56
  • Why the first jump entry can't just be an effective opcode? At least in Zilog 80, the next effective instruction was the target of the jump. – venge Jun 03 '15 at 14:58
  • I think many machines predict "next instruction" on a jmp indirect. If the next instruction was your most common case, that might be helpful. – Ira Baxter Jun 03 '15 at 15:21
  • IDK if the common-case branches would really be well predicted. I mean, unless load instructions come in a pattern, like every-3rd-insn-is-a-load, branch prediction is going to do poorly. `push`/`pop` tends to come in blocks, so it might be a good candidate for this treatment, though. You could also check for `jcc` as the next insn after `test` / `cmp`, as that would be the common case. (But only if you can do it in code that only runs for `test`/`cmp`. Don't add a branch to the main path to check whether the last insn was one of those.) – Peter Cordes Jul 02 '15 at 13:38
  • @PeterCordes: It turns out that when you examine instruction streams, that most instructions turn out to be LOAD/CMP/JC. So, every third instruction *is* a load :-} I like your idea of using the previously executed instruction to control the checks for the next common one (e.g., cmp is almost always followed by jc, as you observed). This code can be put into the switch block for the "previously" executed instruction. OP can measure instruction pair frequencies to find out what is common, using a baseline undecorated switch. – Ira Baxter Jul 02 '15 at 14:18
  • Not just 3rd-on-average, but literally every third, with enough regularity for a branch predictor to hit > 70% or so? Beware the danger of overdoing it, and using up branch-predictor entries if the benefit is low. Most `case` entries probably have branches. – Peter Cordes Jul 02 '15 at 14:26
  • And yeah, tacked on to the end of the switch block for `cmp` is what I was picturing. The difficulty with that is that you'd need to replicate any outer-loop code that sets up for the next instruction. (Or put it in a function that can be called from before the switch, or from the end of some cases.) – Peter Cordes Jul 02 '15 at 14:28