I am developing a simple VM and I am in the middle of a crossroad.
My initial goal was to use byte long instruction, and therefore a small loop and a quick computed goto dispatch.
However, turns out reality could not be further from it - 256 is nowhere near enough to cover signed and unsigned 8, 16, 32 and 64bit integers, floats and doubles, pointer operations, the different combinations of addressing. One option was to not implement byte and shorts but the goal is to make a VM that supports the full C subset as well as vector operations, since they are pretty much everywhere anyway, albeit in different implementations.
So I switched to 16bit instruction, so now I am also able to add portable SIMD intrinsics and more compiled common routines that really save on performance by not being interpreted. There is also caching of global addresses, initially compiled as base pointer offsets, the first time an address is compiled it simply overwrites the offset and instruction so that next time it is a direct jump, at the cost of and extra instruction in the set for each use of a global by an instruction.
Since I am not in the stage of profiling, I am in a dilemma, are the extra instructions worth the more flexibility, will the presence of more instructions and therefore the absence of copying back and forth instructions make up for the increased dispatch loop size? Keeping in mind the instructions are just a few assembly instructions each, e.g:
.globl __Z20assign_i8u_reg8_imm8v
.def __Z20assign_i8u_reg8_imm8v; .scl 2; .type 32; .endef
__Z20assign_i8u_reg8_imm8v:
LFB13:
.cfi_startproc
movl _ip, %eax
movb 3(%eax), %cl
movzbl 2(%eax), %eax
movl _sp, %edx
movb %cl, (%edx,%eax)
addl $4, _ip
ret
.cfi_endproc
LFE13:
.p2align 2,,3
.globl __Z18assign_i8u_reg_regv
.def __Z18assign_i8u_reg_regv; .scl 2; .type 32; .endef
__Z18assign_i8u_reg_regv:
LFB14:
.cfi_startproc
movl _ip, %edx
movl _sp, %eax
movzbl 3(%edx), %ecx
movb (%ecx,%eax), %cl
movzbl 2(%edx), %edx
movb %cl, (%eax,%edx)
addl $4, _ip
ret
.cfi_endproc
LFE14:
.p2align 2,,3
.globl __Z24assign_i8u_reg_globCachev
.def __Z24assign_i8u_reg_globCachev; .scl 2; .type 32; .endef
__Z24assign_i8u_reg_globCachev:
LFB15:
.cfi_startproc
movl _ip, %eax
movl _sp, %edx
movl 4(%eax), %ecx
addl %edx, %ecx
movl %ecx, 4(%eax)
movb (%ecx), %cl
movzwl 2(%eax), %eax
movb %cl, (%eax,%edx)
addl $8, _ip
ret
.cfi_endproc
LFE15:
.p2align 2,,3
.globl __Z19assign_i8u_reg_globv
.def __Z19assign_i8u_reg_globv; .scl 2; .type 32; .endef
__Z19assign_i8u_reg_globv:
LFB16:
.cfi_startproc
movl _ip, %eax
movl 4(%eax), %edx
movb (%edx), %cl
movzwl 2(%eax), %eax
movl _sp, %edx
movb %cl, (%edx,%eax)
addl $8, _ip
ret
.cfi_endproc
This example contains the instructions to:
- assign unsigned byte from immediate value to register
- assign unsigned byte from register to register
- assign unsigned byte from global offset to register and, cache and change to direct instruction
- assign unsigned byte from global offset to register (the now cached previous version)
- ... and so on...
Naturally, when I produce a compiler for it, I will be able to test the instruction flow in production code and optimize the arrangement of the instructions in memory to pack together the frequently used ones and get more cache hits.
I just have a hard time figuring if such a strategy is a good idea, the bloat will make up for flexibility, but what about performance? Will more compiled routines make up for a larger dispatch loop? Is it worth caching global addresses?
I would also like for someone, decent in assembly to express an opinion on the quality of the code that is generated by GCC - are there any obvious inefficiencies and room for optimization? To make the situation clear, there is a sp
pointer, which points to the stack that implements the registers (there is no other stack), ip
is logically the current instruction pointer, and gp
is the global pointer (not referenced, accessed as an offset).
EDIT: Also, this is the basic format I am implementing the instructions in:
INSTRUCTION assign_i8u_reg16_glob() { // assign unsigned byte to reg from global offset
FETCH(globallAddressCache);
REG(quint8, i.d16_1) = GLOB(quint8);
INC(globallAddressCache);
}
FETCH returns a reference to the struct, which the instruction is using based on the opcode
REG returns a reference to register value T from offset
GLOB retursn a reference to global value from a cached global offset (effectively absolute address)
INC just increments the instruction pointer by the size of the instruction.
Some people will probably suggest against the usage of macroses, but with templates it is much less readable. This way the code is pretty obvious.
EDIT: I would like to add a few points to the question:
I could go for a "register operations only" solution which can only move data between registers and "memory" - be that global or heap. In this case, every "global" and heap access will have to copy the value, modify or use it, and move it back to update. This way I have a shorter dispatch loop, but a few extra instructions for each instruction that addresses non-register data. So the dilemma is a few times more native code with longer direct jumps, or a few times more interpreted instructions with shorter dispatch loop. Will a short dispatch loop give me enough performance to make up for the extra and costly memory operations? Maybe the delta between the shorter and longer dispatch loop is not enough to make a real difference? In terms of cache hits, in terms of the cost of assembly jumps.
I could go for additional decoding and only 8bit wide instructions, however, this may add another jump - jump to wherever this instruction is handled, then waste time on either jumping to the case the particular addressing scheme is handled or decoding operations and a more complex execution method. And in the first case, the dispatch loop still grows, plus adding yet another jump. The second option - register operations can be used to decode the addressing, but a more complex instruction with more compile time unknown will be needed in order to address anything. I am not really sure how will this stack up with a shorter dispatch loop, once again, uncertain how my "shorter and longer dispatch loop" relates to what is considered short or long in terms of assembly instructions, the memory they need and the speed of their execution.
I could go for the "many instructions" solution - the dispatch loop is a few times larger, but it still uses pre-computed direct jumping. Complex addressing is specific and optimized for each instruction and compiled to native, so the extra memory operations that would be needed by the "register only" approach will be compiled and mostly executed on the registers, which is good for performance. Generally, the idea is add more to the instruction set but also add to the amount of work that can be compiled in advance and done in a single "instruction". The loner instruction set also means longer dispatch loop, longer jumps (although that can be optimized to minimize), less cache hits, but the question is BY HOW MUCH? Considering every "instruction" is just a few assembly instructions, is an assembly snippet of about 7-8k instructions considered normal, or too much? Considering the average instruction size varies around 2-3b, this should not be more than 20k of memory, enough to completely fit in most L1 caches. But this is not concrete math, just stuff I came at googling around, so maybe my "calculations" are off? Or maybe it doesn't work that way? I am not that experienced in caching mechanisms.
To me, as I currently weight the arguments, the "many instructions" approach appears to have the biggest chances for best performance, provided of course, my theory about fitting the "extended dispatch loop" in the L1 cache holds. So here is where your expertise and experience comes into play. Now that the context is narrowed and a few support ideas presented, maybe it will be easier to give a more concrete answer whether the benefits of a larger instruction set prevail over the size increase of native code by decreasing the amount of the slower, interpreted code.
My instruciton size data is based on those stats.