How to use LLVM to convert stack-based virtual machine bytecode to SSA form

Question

There are a number of questions on how to convert SSA representation to stack machines, but I'm interested in the inverse.

Question

Consider a stack-based VM with conditional/unconditional jumps, where each opcode has a fixed number of stack elements it consumes and produces.

Are there tools/approaches in the LLVM framework to reconstruct an SSA form from the bytecode output. This would be essentially a form of disassembly.

arnt · Accepted Answer · 2019-05-02T08:54:52.740

There are no tools in LLVM itself, but it's just s SMoP. I've done it. Parts of it were difficult but so's anything. I'll answer instead of comment to ramble a bit about the most difficult part.

Stacks are typically typeless; the value that is on the top of the stack has a type, but "top of stack" does not. An LLVM Value always has a type, and those two systems collide when the code contains loops. Consider this code:

int a = b();
while(a<10)
    a++;

a has a type and all values of it will be int (perhaps i32 in LLVM IR). When the first line pushes the return value from b() onto the stack, the top of stack acquires type int. You can probably envision how those lines look on your stack machine. It ought to be translated to IR rather like this:

entry:
  %a1 = call @b();
  br label %b1
b1:
  %stack.0.b1 = phi i32 [%entry, %a1], [%b1, %a2]
  %a2 = add i32 1, %stack.0.b1
  %done = icmp ult i32 %stack.0.b1, 10
  br i1 %done, label %b1, label %b2
b2:

(Sorry about the syntax errors, I haven't written much IR by hand.)

You probably see that each instruction except the phi can be generated from a single instruction in your stack language. Perhaps an instruction in your stack language leads to more than one IR instruction, or leads to no IR instructions, e.g. dup or push-constant-zero, which merely modify the stack.

The phi is different, it represents the stack at that point.

The stack on entry to block b1 is computed from the stack at the end of each of entry and b1. You can generate a phi node for each value on the stack at the start of each basic block; the challenge is that the type of each phi node depends on the types on the stack at the end of the preceding blocks. In this case the stack at the end of entry has one entry, a1 and at the ened of b1 is has one, a2. Therefore the type of stack.0.b1 depends on that of a2, which in turn depends on stack.0.b1. You'll need to think hard about that, particularly if your language includes implicit type promotion or casting (i32 to i64, string to object, etc).

(I could have started with a ruby-like type system and code instead of c-like; I think the final problem would be the same, only your solution is different.)

Thanks a lot for explaining, I had one specific issue: I have a working untyped implementation, but it's unsatisfying. The challenge is that the virtual machine had jump locations that depended on the path into that node (i.e., CFG had loops that only run once). I don't know how to represent that in LLVM. Is there a way for phi conditions to effectively "look ahead" beyond the immediate nodes? — Peteris, May 02 '19 at 09:25
I'm not sure I understand the question... if I do, then you could sort of do that using one or more i1 phi nodes. I'd probably have a problem with it and would try to change the problem to avoid the solution, but it could be done: `phi i1 [%b1, 1], [%b2, 0]`. Now looking at the value of that phi will tell you whether block b1 has been executed. If you have to propagate it using a chain of several phis, you could also fall back to writing to alloca'd memory, which llvm will turn into the necessary phis. God how ugly this is. — arnt, May 02 '19 at 09:44
Oh I see what you mean. The problem here is that sometimes the path could be b0 -> b1 -> b2 -> b1 -> b3. Now the jump from b1 is conditional on either b0 being the previous node or b2 being the previous node, but b0 will always be visited in either case! So it cannot be a simple phi condition between b0 and b2. Maybe my best bet is to tell the compiler authors who are outputting this byte-code to remove these path-dependent jump destinations (it's a relatively new programming language). They are hardly better than dynamic jumps and unnecessary. — Peteris, May 02 '19 at 09:51
`p1 = phi i1 [%b1, 1 ], [%b2, 0]` in block b3 then `p2 = phi i1 [%b3, %p1], [%b4, 0]` propagates the information out of b3 to its successor... but I admit it's an ugly-enough hack that if you read this before lunch it might spoil your appetite. — arnt, May 02 '19 at 11:14

How to use LLVM to convert stack-based virtual machine bytecode to SSA form

1 Answers1

Linked