Rellic [1] implements an algorithm that generates goto-free control flows (citat...

dataflow · on July 13, 2022

> that generates goto-free control flows

...note: from LLVM bitcode.

lmz · on July 13, 2022

https://github.com/lifting-bits/remill is linked from there, which I guess is where you get your bitcode from.

dataflow · on July 13, 2022

Oh that's cool, thanks!

FrozenVoid · on July 13, 2022

What happens with code that uses lots of gotos(incl. computed gotos)?

jcranmer · on July 13, 2022

From reading the paper, it basically does jump unthreading. Basically, if you imagine code like this:

  bool found = false;
  for (...) {
    if (...) {
      found = true;
      break;
    }
  }
  if (found) {
    // A
  } else {
    // B
  }

Jump threading is an optimization pass that replace the break statement with a goto A. After that replacement, found is always false, so the boolean variable and the if statement is deleted. The resulting code would look something like this [1]:

  for (...) {
    if (...) {
      // A
      goto end;
    }
  }
  // B
  end:;

What the lifting is doing here is essentially running this pass in reverse. If you find a branch pattern that doesn't meet any preordained schema (such as a loop with multiple exits), just synthesize a variable that tells you which target you're going to jump to. Were the compiler to optimize the resulting code, jump threading would convert it back into the gotos present in the compiled binary.

[1] This kind of optimization pass runs at a stage when the code is basically treated entirely as a CFG and there's no such thing as if statements or jumps or gotos, just conditional and unconditional branches terminating basic blocks. Any reflection of the code outside of this form is therefore somewhat imprecise.

mdp2021 · on July 13, 2022

Can you think of an example of "goto"-based code that cannot be translated into conventionally structure code?

gjm11 · on July 13, 2022

[EDITED to say explicitly:] You can translate any goto-laden code into "conventionally structured" code mechanically, if you don't care about having the structure of the resulting code actually indicate what it does. Here's an example of the sort of code for which that might be the best you can do.

Suppose you implement a state machine with gotos. So, for a simple (and contrived) example, suppose you have something that absorbs the decimal digits of a number and keeps track of the value of the number modulo 3 by having three states. Something like this (pseudocode):

    def mod3():
        state0:
            d = getdigit()
            if d == FINISHED: return 0
            if d is 0, 3, 6, 9: goto state0
            if d is 1, 4, 7: goto state1
            if d is 2, 5, 8: goto state2
            return ERROR
        state1: 
            d = getdigit()
            if d == FINISHED: return 0
            if d is 0, 3, 6, 9: goto state1

(etc.) You've got three stateN labels each of which can jump to any of the stateN labels (as well as being able to return from the function).

If you have tail-call optimization you can turn this into conventionally structured code, more or less:

    def state0():
        d = getdigit()
        if d == FINISHED: return 0
        if d is 0, 3, 6, 9: return state0()
        if d is 1, 4, 7: return state1()
        if d is 2, 5, 8: return state2()
        return ERROR

with similar definitions for state1() and state2(), and then the top-level function just calls state0. But this depends on knowing that all those tail calls will get optimized, or else on never having enough digits to overflow your stack.

Or else you can have an explicit state variable:

    def mod3():
        state = 0
        loop:
            if state == 0:
                d = getdigit()
                if d == FINISHED: return 0
                if d is 0, 3, 6, 9: state = 0
                else if d is 1, 4, 7: state = 1
                else if d is 2, 5, 8: state = 2
                else: return ERROR
            else if state == 1:
                ...
            else:
                ...

which works pretty well for the special case of state machines but badly for most other things a goto might be used for. (Though obviously you can translate literally any goto-using code into this sort of thing. You might want to call the analogue of the "state" variable here "program_counter" in that case.)

not2b · on July 13, 2022

The translation can always be done, but for dense spaghetti control structures duplication of code may be required. One can construct artificial cases where the size increase is exponential, but that's unlikely to be an issue in even the worst real code.

psifertex · on July 14, 2022

This is actually why we chose _not_ to implement no-more-gotos for Binary Ninja's HLIL! Code is actually more readable with gotos in some situations and trying to force their elimination hurts readability.

zozbot234 · on July 13, 2022

> Rellic [1] implements an algorithm that generates goto-free control flows

Doesn't WebAssembly implement that already, via Relooper?