Decompiler - how to structure loops

Question

I'm writing a decompiler for a simple scripting language. Here is what I've done:

Basic blocks

Created a collection of basic blocks as described here:

http://www.backerstreet.com/decompiler/basic_blocks.php

Control flow graph, dominator tree and loop sets

From this I could create the control flow graph.

http://www.backerstreet.com/decompiler/control_flow_graph.php

From the CFG I created the dominator tree, which in turn allowed me to find the loop sets in the CFG as described here:

http://www.backerstreet.com/decompiler/loop_analysis.php

Here is an image containing all of my data so far:

Control flow diagram

Structuring loops

My next step should be:

http://www.backerstreet.com/decompiler/creating_statements.php

And this is where my question comes in because I'm totally stuck. In my given data how would the structuring loops algorithm apply? I don't understand why it starts by trying to structure everything as a do while loop - in my example this means that "temp5_14 = temp5_14 + 16" in block 3 would always get executed at least once which is not what the original code does at all.

How could that work and how would the next pass of converting it from a do-while to a while loop actually work? For loop 3 that ends in block 6 this looks like it should be a while(true) - but how would this work with the algorithm when its head block is an if statement?

TL;DR - someone please explain with examples how the "structuring loops" algorithm actually works.

Antimony · Accepted Answer · 2014-12-02T07:04:23.377

Structuring is the hardest part of decompiler development (at least for high level languages). That's a fairly simplistic algorithm, so it's a good starting point, but you'll likely want to use a better algorithm or make your own if you're working on a real decompiler.

With that out of the way, the answer to your actual question about how it can use do-while loops instead of while loops is already answered on the page you linked to.

Every loop can be described with a "do-while" statement.

The "while" loop (pre-tested loop) is a special case of a "do-while" loop, where the bottom condition is always true, and the first statement of the loop is an "if" that jumps out of the loop.

Say you had something like

beforeloop
while(foo) {
stmt1
stmt2
}
afterloop

It would be compiled to something along the lines of

beforeloop

LOOPBEGIN:
if !foo goto LOOPEND
stmt1
stmt2
goto LOOPBEGIN

LOOPEND:
afterloop

The decompiler algorithm converts this to

beforeloop
do {
if (!foo) {break}
stmt1
stmt2
} while (true)
afterloop

I hope that cleared that up. If not, feel free to ask about any other questions.

Edit: Example 2, showing how multiple loops with the same entry point can be collapsed.

for(;;) { while(foo) {} while(bar){} }

First off, for(;;) is equivalent to while(true), so I'll use the following (pseudo)code instead

while(true) { while(foo) {stmt1} while(bar){stmt2} }

Let the outer loop be loop A and the inner loops be loop B and C. This compiles to something like the following pseudo assembly.

LOOP_A_BEGIN:
LOOP_B_BEGIN:
if !foo goto LOOP_B_END
stmt1
goto LOOP_B_BEGIN
LOOP_B_END:
LOOP_C_BEGIN:
if !bar goto LOOP_C_END
stmt2
goto LOOP_C_BEGIN
LOOP_C_END:
goto LOOP_A_BEGIN

But of course labels don't take up any space. So with identical labels collapsed, it becomes

POINT1:
if !foo goto POINT2
stmt1
goto POINT1
POINT2:
if !bar goto POINT3
stmt2
goto POINT2
POINT3
goto POINT1

Now, there are two points with backedges - point 1 and point 2. We can create one loop for each node, using labeled breaks for clarity. The transform isn't quite as straightforward since you have to mess with the if statements a bit, but it's still pretty easy.

LOOP1: while(true) {
    IF1: if (!foo) {
        break IF1;
    }
    else {
        stmt1;
        continue LOOP1;
    }

    LOOP2: while(true) {
        if (!bar) {
            break LOOP2;
        }
        else {
            stmt2;
            continue LOOP2;
        }
    }

    continue LOOP1;
}

Now, the same code with unnecessary labels simplified out

while(true) {
    if (!foo) {
    }
    else {
        stmt1;
        continue;
    }

    while(true) {
        if (!bar) {
            break;
        }
        else {
            stmt2;
        }
    }
}

Now with if statements simplified

while(true) {
    if (foo) {
        stmt1;
        continue;
    }

    while(true) {
        if (!bar) {
            break;
        }
        stmt2;
    }
}

And finally, you can apply the while(true) if(!x) transform to the inner loop. The outer loop can't be transformed like this, since it's not a simple while(cond) loop due to being the result of merged loops.

while(true) {
    if (foo) {
        stmt1;
        continue;
    }

    while(bar) {
        stmt2;
    }
}

So hopefully, this demonstrates how you can always handle the case of multiple loops with the same entry point by merging them into a single loop, at the possible expense of having to rearrange some if statements too.

I think this makes sense, but then how would the do {} while(true) become while(foo) { } in the next step? This do { } while(true) step is added lots more basic blocks to the CFG? — paulm, Nov 27 '14 at 09:53
@paulm Their method doesn't make much sense to me. But one way to do it is A) convert everything to while(true) and then B) look for the pattern while(true) {if(foo) {break} ...) and convert that to while(foo) {...} — Antimony, Nov 27 '14 at 21:34
Now I'm more confused :) the logic my CFG isn't if(foo) break; it would have to be converted to if(!foo) break? Also how would this work for B3 to B2 and B6 to B2 and both loop headers are the same block yet one is actually a nested loop? — paulm, Nov 28 '14 at 07:56
Oops, I messed up. You correct that it should be if(!foo) break. You should never have nested loops with the same start block, because if they both have conditions, then they both start with distinct if statements. If they don't have conditions, then you can simplify it by generating only one loop. — Antimony, Nov 28 '14 at 17:34
But in my CFG diagram it seems to me that it must be a for(;;) { while(cond) {} while(cond){} } structure? those last two braces are what cause both of the loops to have the same header? — paulm, Nov 28 '14 at 18:02
@paulm for(;;) is equivalent to while(true). So you have a loop with a condition nested inside a loop with no condition. But since the outer loop has no condition, it can be simplified and merged with the first inner while loop. I can post a demonstration of that example if you're still confused. — Antimony, Nov 28 '14 at 20:10
A demonstration would be nice, I honestly can't see how those 2 loops become 1 — paulm, Nov 28 '14 at 20:18
@paulm Sorry I took so long to respond. I've added an example of loop merging. — Antimony, Dec 02 '14 at 07:06

zneak · Answer 2 · 2015-08-10T19:12:20.273

There's a paper called No More Gotos that suggests an algorithm for pattern-independent control flow structuring. It's a good chunk of work to understand and implement, but I'm using it for my senior project and it looks like it works.

My implementation is available on GitHub at zneak/fcd. The project uses LLVM IR as the intermediate representation.

EDIT At a very high level, this is how the algorithm works:

Domination

Just to be sure that these concepts are understood: a node A dominates a node Z if any path from the entry of the graph to Z has to go through A. Similarly, a node Z post-dominates a node A if any path from A to the end of the graph has to go through Z.

Regions

The algorithm structures regions. A region is a part of the graph that has a single entry edge and a single exit edge. I have personally extended this definition to basic blocks (regions have a single entry block and a single exit block), and made it so the exit block is excluded from the region.

The definition of a region that I use is:

The entry node dominates the exit node;
The exit node post-dominates the entry node;
Once you reach the exit, any path that takes you back into the region passes through the entry node. (This last bit is very important and solves problems related to cycles.)

The definition implies that the entry dominates every node through the exit node, and the exit node post-dominates every node from the entry.

Normalize loops

A loop is a region with a "back edge" (a node with an edge going back to an already-visited node if you traversed your graph depth-first).

Make sure that loops are represented as single-entry single-exit regions with a single back-edge. That is, they should have only one entry node (to which the back edge also points) and a single successor. When it's not the case, you can introduce a new entry block and make all the edges point to it, then use a Φ-node to forward execution from there (in other words, introduce a variable that you set at the end of each incoming block, and do if (var == 0) { first entry } else if (var == 1) { second entry } from the new block).

In my implementation, this happens in the StructurizeCFG pass, in the master branch, as of writing. However, it produces poor results because it works much harder than it needs to. I only need it to structurize loops, but it also structurizes if-else constructs, and while it doesn't break the algorithm, it introduces waaaay to many Φ-nodes to make for a pretty output. As of writing too, there's a branch called seseloop with a custom pass to ensure that loops are single-entry single-exit. This pass doesn't touch if-else constructs if it doesn't need to.

Structurize your functions

Traverse the basic block graph in post-order. Identify regions starting at this block. You can use the post-dominator tree to speed this up a bit, as a region has to end with a block's post-dominator (so for each block, only check with the block's post-dominators).

If the entry block has a back-edge pointing to it, structurize it as a loop. If not, structurize it as a region. When a region has been structurized, put it back in your graph as a collapsed single node that can be included in another larger structurized region.

This happens in ast/ast_backend.cpp.

Structurizing regions

Use depth-first traversal (skipping cycles) on the region to identify the condition leading to the execution of any block. For instance:

Graph describing an if-then-else sequence. Node A is the entry node; node B is the true node; node C is the false node, and node D is the exit node.

Node A has no condition. Node B is reached if the condition at the end of node A is true. Node C is reached if it's false. Node D is reached if it was either true or false.

(A);
if (a_cond) (B);
if (!a_cond) (C);
if (a_cond || !a_cond) (D);

You then have to simplify these conditions, which is unfortunately a NP-complete problem. In general, it shouldn't be too hard to get back to something like A; if (a_cond) B; else C; D; by collapsing a_cond || !a_cond and comparing conditional terms in order.

Structurizing loops

You basically do the same thing as if you were structurizing a region without caring that it's a loop, but after that you add break statements (with conditions when relevant) at the end of blocks that can exit the loop and you wrap the region in a while true block. Then, the authors have identified 6 patterns that can be replaced with more human-readable patterns (for instance, a while true starting with an if (cond) break can be transformed into a while !cond).

And that's about it.

Will you be writing this is a front end agnostic way? I.e so any sort of byte code can be plugged in? — paulm, Jul 20 '15 at 06:43
I don't think so. Quite honestly, I've got enough problems on my hands that I'll focus on solving mine first! I'm using LLVM IR. — zneak, Jul 20 '15 at 13:48
A new link to the paper: https://www.ndss-symposium.org/wp-content/uploads/2017/09/11_4_2.pdf — Antonio Noack, Sep 06 '22 at 07:52