VM interpreter - weighting performance benefits and drawbacks of larger instruction set / dispatch loop

Question

I am developing a simple VM and I am in the middle of a crossroad.

My initial goal was to use byte long instruction, and therefore a small loop and a quick computed goto dispatch.

However, turns out reality could not be further from it - 256 is nowhere near enough to cover signed and unsigned 8, 16, 32 and 64bit integers, floats and doubles, pointer operations, the different combinations of addressing. One option was to not implement byte and shorts but the goal is to make a VM that supports the full C subset as well as vector operations, since they are pretty much everywhere anyway, albeit in different implementations.

So I switched to 16bit instruction, so now I am also able to add portable SIMD intrinsics and more compiled common routines that really save on performance by not being interpreted. There is also caching of global addresses, initially compiled as base pointer offsets, the first time an address is compiled it simply overwrites the offset and instruction so that next time it is a direct jump, at the cost of and extra instruction in the set for each use of a global by an instruction.

Since I am not in the stage of profiling, I am in a dilemma, are the extra instructions worth the more flexibility, will the presence of more instructions and therefore the absence of copying back and forth instructions make up for the increased dispatch loop size? Keeping in mind the instructions are just a few assembly instructions each, e.g:

    .globl  __Z20assign_i8u_reg8_imm8v
    .def    __Z20assign_i8u_reg8_imm8v; .scl    2;  .type   32; .endef
__Z20assign_i8u_reg8_imm8v:
LFB13:
    .cfi_startproc
    movl    _ip, %eax
    movb    3(%eax), %cl
    movzbl  2(%eax), %eax
    movl    _sp, %edx
    movb    %cl, (%edx,%eax)
    addl    $4, _ip
    ret
    .cfi_endproc
LFE13:
    .p2align 2,,3
    .globl  __Z18assign_i8u_reg_regv
    .def    __Z18assign_i8u_reg_regv;   .scl    2;  .type   32; .endef
__Z18assign_i8u_reg_regv:
LFB14:
    .cfi_startproc
    movl    _ip, %edx
    movl    _sp, %eax
    movzbl  3(%edx), %ecx
    movb    (%ecx,%eax), %cl
    movzbl  2(%edx), %edx
    movb    %cl, (%eax,%edx)
    addl    $4, _ip
    ret
    .cfi_endproc
LFE14:
    .p2align 2,,3
    .globl  __Z24assign_i8u_reg_globCachev
    .def    __Z24assign_i8u_reg_globCachev; .scl    2;  .type   32; .endef
__Z24assign_i8u_reg_globCachev:
LFB15:
    .cfi_startproc
    movl    _ip, %eax
    movl    _sp, %edx
    movl    4(%eax), %ecx
    addl    %edx, %ecx
    movl    %ecx, 4(%eax)
    movb    (%ecx), %cl
    movzwl  2(%eax), %eax
    movb    %cl, (%eax,%edx)
    addl    $8, _ip
    ret
    .cfi_endproc
LFE15:
    .p2align 2,,3
    .globl  __Z19assign_i8u_reg_globv
    .def    __Z19assign_i8u_reg_globv;  .scl    2;  .type   32; .endef
__Z19assign_i8u_reg_globv:
LFB16:
    .cfi_startproc
    movl    _ip, %eax
    movl    4(%eax), %edx
    movb    (%edx), %cl
    movzwl  2(%eax), %eax
    movl    _sp, %edx
    movb    %cl, (%edx,%eax)
    addl    $8, _ip
    ret
    .cfi_endproc

This example contains the instructions to:

assign unsigned byte from immediate value to register
assign unsigned byte from register to register
assign unsigned byte from global offset to register and, cache and change to direct instruction
assign unsigned byte from global offset to register (the now cached previous version)
... and so on...

Naturally, when I produce a compiler for it, I will be able to test the instruction flow in production code and optimize the arrangement of the instructions in memory to pack together the frequently used ones and get more cache hits.

I just have a hard time figuring if such a strategy is a good idea, the bloat will make up for flexibility, but what about performance? Will more compiled routines make up for a larger dispatch loop? Is it worth caching global addresses?

I would also like for someone, decent in assembly to express an opinion on the quality of the code that is generated by GCC - are there any obvious inefficiencies and room for optimization? To make the situation clear, there is a sp pointer, which points to the stack that implements the registers (there is no other stack), ip is logically the current instruction pointer, and gp is the global pointer (not referenced, accessed as an offset).

EDIT: Also, this is the basic format I am implementing the instructions in:

INSTRUCTION assign_i8u_reg16_glob() { // assign unsigned byte to reg from global offset
    FETCH(globallAddressCache);
    REG(quint8, i.d16_1) = GLOB(quint8);
    INC(globallAddressCache);
}

FETCH returns a reference to the struct, which the instruction is using based on the opcode

REG returns a reference to register value T from offset

GLOB retursn a reference to global value from a cached global offset (effectively absolute address)

INC just increments the instruction pointer by the size of the instruction.

Some people will probably suggest against the usage of macroses, but with templates it is much less readable. This way the code is pretty obvious.

EDIT: I would like to add a few points to the question:

I could go for a "register operations only" solution which can only move data between registers and "memory" - be that global or heap. In this case, every "global" and heap access will have to copy the value, modify or use it, and move it back to update. This way I have a shorter dispatch loop, but a few extra instructions for each instruction that addresses non-register data. So the dilemma is a few times more native code with longer direct jumps, or a few times more interpreted instructions with shorter dispatch loop. Will a short dispatch loop give me enough performance to make up for the extra and costly memory operations? Maybe the delta between the shorter and longer dispatch loop is not enough to make a real difference? In terms of cache hits, in terms of the cost of assembly jumps.
I could go for additional decoding and only 8bit wide instructions, however, this may add another jump - jump to wherever this instruction is handled, then waste time on either jumping to the case the particular addressing scheme is handled or decoding operations and a more complex execution method. And in the first case, the dispatch loop still grows, plus adding yet another jump. The second option - register operations can be used to decode the addressing, but a more complex instruction with more compile time unknown will be needed in order to address anything. I am not really sure how will this stack up with a shorter dispatch loop, once again, uncertain how my "shorter and longer dispatch loop" relates to what is considered short or long in terms of assembly instructions, the memory they need and the speed of their execution.
I could go for the "many instructions" solution - the dispatch loop is a few times larger, but it still uses pre-computed direct jumping. Complex addressing is specific and optimized for each instruction and compiled to native, so the extra memory operations that would be needed by the "register only" approach will be compiled and mostly executed on the registers, which is good for performance. Generally, the idea is add more to the instruction set but also add to the amount of work that can be compiled in advance and done in a single "instruction". The loner instruction set also means longer dispatch loop, longer jumps (although that can be optimized to minimize), less cache hits, but the question is BY HOW MUCH? Considering every "instruction" is just a few assembly instructions, is an assembly snippet of about 7-8k instructions considered normal, or too much? Considering the average instruction size varies around 2-3b, this should not be more than 20k of memory, enough to completely fit in most L1 caches. But this is not concrete math, just stuff I came at googling around, so maybe my "calculations" are off? Or maybe it doesn't work that way? I am not that experienced in caching mechanisms.

To me, as I currently weight the arguments, the "many instructions" approach appears to have the biggest chances for best performance, provided of course, my theory about fitting the "extended dispatch loop" in the L1 cache holds. So here is where your expertise and experience comes into play. Now that the context is narrowed and a few support ideas presented, maybe it will be easier to give a more concrete answer whether the benefits of a larger instruction set prevail over the size increase of native code by decreasing the amount of the slower, interpreted code.

My instruciton size data is based on those stats.

score 5 · Answer 1 · answered Aug 12 '13 at 11:42

You might want to consider separating the VM ISA and its implementation.

For instance, in a VM I wrote I had a "load value direct" instruction. The next value in the instruction stream wasn't decoded as an instruction, but loaded as a value into a register. You can consider this one macro instruction or two separate values.

Another instruction I implemented was a "load constant value", which took loaded a constant from memory (using a base address for the table of constants and an offset). A common pattern in the instruction stream was therefore load value direct (index); load constant value. Your VM implementation may recognize this pattern and handle the pair with a single optimized implementation.

Obviously, if you have enough bits, you can use some of them to identify a register. With 8 bits it may be necessary to have a single register for all operations. But again, you could add another instruction with register X which modifies the next operation. In your C++ code, that instruction would merely set the currentRegister pointer which the other instructions use.

score 3 · Answer 2 · answered Aug 12 '13 at 08:58

3

Will more compiled routines make up for a larger dispatch loop?

I take it you didn't fancy having single byte instructions with a second byte of extra opcode for certain instructions? I think a decode for 16-bit opcodes may be less efficient than 8-bit + extra byte(s), assuming the extra byte(s) aren't too common or too difficult to decode in themselves.

If it was me, I'd work on getting the compiler (not necessarily a full-fledged compiler with "everything", but a basic model) going with a fairly limited set of "instructions". Keep the code generation part fairly flexible so that it'll be easy to alter the actual encoding later. Once you have that working, you can experiment with various encodings and see what the result is in performance, and other aspects.

A lot of your minor question points are very hard to answer for anyone that hasn't done both of the choices. I have never written a VM in this sense, but I have worked on several disassemblers, instruction set simulators and such things. I have also implemented a couple of languages of different kinds, in terms of interpreted languages.

You probably also want to consider a JIT approach, where instead of loading bytecode, you interpret the bytecode and produce direct machine code for the architecture in question.

The GCC code doesn't look terrible, but there are several places where code depends on the value of the immediately preceding instruction - which is not great in modern processors. Unfortunately, I don't see any solution to that - it's a "too short code to shuffle things around" problem - adding more instructions obviously won't work.

I do see one little problem: Loading a 32-bit constant will require that it's 32-bit aligned for best performance. I have no idea how (or if) Java VM's deal with that.

answered Aug 12 '13 at 08:58

Mats Petersson

126,704
14
140
227

I don't do any decoding right now, figured out that is the best way improve performance. Previously I experimented with bit-packed instructions, but the overhead from "decoding" is too much. Also, with an instruction size of 8bit, I need to include too much padding. Since the goal is to implement a subset of C, instead of JIT-ing by own code with my own compiler, I just generate C code, compile and dynamically link it, making native execution "plug in" compatible with the VM. The VM can be used for prototyping and scripting, critical parts can be compiled to native. – Aug 12 '13 at 09:05
As for alignment, the way I envision it everything will be properly aligned, with the exception of an occasional 64bit, that I have no way of aligning and sometimes will be positioned on 32bit boundary. It will probably have a performance impact, but the way to solve that will cost more than reading 64bit aligned as 32bit. – Aug 12 '13 at 09:08
By decoding, I mean what I think you call "dispatch" - that is, figuring out which "instruction" to execute next. Having a 64K table or 64K (presumably somewhat sparse) switch statement will be less efficient than a 256 entry one... – Mats Petersson Aug 12 '13 at 09:08
So, you won't have the "load imm 32-bit value" followed by "value"? How are you going to guarantee alignment there? – Mats Petersson Aug 12 '13 at 09:10
Just because I have 16bit instruction doesn't mean I will have 2^16 instructions. I suspect I will be able to do it in less than 1000, the way things are going right now. Naturally, I will only use a lookup table as long as I need. I don't mean all instructions are 16bit, I mean I am using 16bit number for the instruction index, the shortest instruction is 32bit, the longest is 128bit - the one that holds a 64bit literal. – Aug 12 '13 at 09:12
Well, the instruction will be 16 bits of opcode, 16bits for register offset and 32bit immediate value. 64bit total. – Aug 12 '13 at 09:13
Yes, but if you have an odd number of "simple" 16-bit instructions (I take it that there are some that are just one opcode and nothing else), your next 16+16+32 bit set will be unaligned, unless you put padding in. – Mats Petersson Aug 12 '13 at 09:16
I planned on an instruction that is only opcode, but decided to not use one, there is not much use for one and it will either completely throw off alignment or require more complexity to pad. Right now I am using structs with portable fixed width fields to "address" the instruction members and increment to next instruction, so the compiler padds everything, the only downside is the occasional 64bit aligned as 32bit, which is not worth addressing as an issue. So the smallest instruction I have is 32bits, 16bit opcode and either 16bit or 2x8 bit literals. – Aug 12 '13 at 09:21
That seems quite space inefficient to me. Probably doesn't make a huge amount of difference, but space is still an issue for caches and such. – Mats Petersson Aug 12 '13 at 09:45
Cache is important for performance, but as my current dispatch mechanism is implanted, in order to pad I need to add an extra instruction that does nothing but increment ip. This will probably take more time than reading an occasional 64bit value in two passes. This applies only to the instructions, all user data is properly to be used with SIMD vector units. Also, the VM is intended for prototyping and scripting, code format may not be the most efficient size-wise, but for glue code the overhead will be small, the rest static stuff can be pre-compiled and linked so it runs in native. – Aug 12 '13 at 10:14
If it's strictly for prototyping, why not use Python or some such language? And it's A LOT of work for something that is of "no real use" from what you describe. – Mats Petersson Aug 12 '13 at 10:16
Because I want to solve a problem modern languages have introduced - different language/tools and troublesome integration with native binary. I am planning a language where the compiled and dynamic paradigm is identical, same syntax, it is just a matter of personal decision whether a block of code is compiled or interpreted. The possibility to use the same API in static and dynamic. It can also natively link to C and C++ code without awkward interfaces like JNI or ActiveX or whatever. Basically I am creating a C style programming language with "emulated" OOP and other high-order constructs. – Aug 12 '13 at 10:21
But unless you can convince EVERYONE to use YOUR language, AND implement all the existing C and C++ functionality as units that plug in to your language, you will still have to deal with the fact that "We have some function here that we need to use, but it's in language X and we have language Y". But by all means, please go ahead. I'm often a pessimist when it comes to changing things quickly. – Mats Petersson Aug 12 '13 at 10:24
I've used python, but as a C guy the syntax outrages me, plus performance is pretty bad, I am aiming for much better performance. I've also used JS, but the OOP paradigm is very illogical to me, plus while JIT is good for performance code, for bindings it is too slow, interpreting has lower initial latency penalty. I've also used JS with QML and Qt, and I really hate the fact I cannot use the Qt APIs from QML, nor operator overloads and stuff like that. – Aug 12 '13 at 10:24
I am creating the language for myself, if others are willing to use it, I have no problem with that. And since it generates standard C code, it is pretty flexible. Basically, I want to create a tool that allows regular people to program without being programmers (you do have to give up other stuff to become proficient with current tools and technologies). Also, since it will be entire visual, without any typing (except literals and identifiers), I expect it will be more digestible to kids and newbies, and with visual nodes, there is very little that can go wrong in term of typos. – Aug 12 '13 at 10:27

score 1 · Answer 3 · answered Aug 13 '13 at 16:33

I think you are asking the wrong question, and not because it is a bad question, on the contrary, it is an interesting subject and I suspect many people are interested in the results just as I am.

However, so far no one is sharing similar experience, so I guess you may have to do some pioneering. Instead of wondering which approach to use and waste time on the implementation of boilerplate code, focus on creating a “reflection” component that describes the structure and properties of the language, create a nice polymorphic structure with virtual methods, without worrying about performance, create modular components you can assemble during runtime, there is even the option to use a declarative language once you have established the object hierarchy. Since you appear to use Qt, you have half the work cut out for you. Then you can use the tree structure to analyze and generate a variety of different code – C code to compile or bytecode for a specific VM implementation, of which you can create multiple, you can even use that to programmatically generate the C code for your VM instead of typing it all by hand.

I think this set of advices will be more beneficial in case you resort to pioneering on the subject without a concrete answer in advance, it will allow you to easily test out all the scenarios and make your mind based on actual performance rather than personal assumptions and those of others. Then maybe you can share the results and answer your question with performance data.

score 0 · Answer 4 · answered Aug 19 '13 at 22:53

The instruction length in bytes has been handled the same way for quite a while. Obviously being limited to 256 instructions isn't a good thing when there's so many types of operations you wish to perform.

This is why there's an prefix value. Back in the gameboy architecture, there wasn't enough room to include the needed 256 bit-control instructions, that's why one opcode was used as a prefix instruction. This kept the original 256 opcodes as well as 256 more if starting with that prefix byte.

For example: One operation might look like this: D6 FF = SUB A, 0xFF

But a prefixed instruction would be presented as: CB D6 FF = SET 2, (HL)

If the processor read CB it'd immediately start looking in another instruction set of 256 opcodes.

The same goes for x86 architecture today. Where any instructions prefixed with 0F would be a part of another instruction set, essentially.

With the sort of execution you're using for your emulator, this is the best way of extending your instruction set. 16-bit opcodes would take up way more space than necessary, and the prefix doesn't provide such a long search.

score 0 · Answer 5 · answered Oct 20 '14 at 03:00

One thing you should decide is what balance you wish to strike between code-file size efficiency, cache efficiency, and raw-execution-speed efficiency. Depending upon the coding patterns for the code you're interpreting, it may be helpful to have each instruction, regardless of its length in the code file, get translated into a structure containing a pointer and an integer. The first pointer would point to a function that takes a pointer to the instruction-info structure as well as to the execution context. The main execution loop would thus be something like:

do
{
  pc = pc->func(pc, &context);
} while(pc);

the function associated with an "add short immediate instruction" would be something like:

INSTRUCTION *add_instruction(INSTRUCTION *pc, EXECUTION_CONTEXT *context)
{
  context->op_stack[0] += pc->operand;
  return pc+1;
}

while "add long immediate" would be: INSTRUCTION *add_instruction(INSTRUCTION *pc, EXECUTION_CONTEXT *context) { context->op_stack[0] += (uint32_t)pc->operand + ((int64_t)(pc[1].operand) << 32); return pc+2; }

and the function associated with an "add local" instruction would be:

INSTRUCTION *add_instruction(INSTRUCTION *pc, EXECUTION_CONTEXT *context)
{
  CONTEXT_ITEM *op_stack = context->op_stack;
  op_stack[0].asInt64 += op_stack[pc->operand].asInt64;
  return pc+1;
}

Your "executables" would consist of compressed bytecode format, but they would then get translated into a table of instructions, eliminating a level of indirection when decoding the instructions at run-time.

VM interpreter - weighting performance benefits and drawbacks of larger instruction set / dispatch loop

5 Answers5