6

I'm thinking to do some bytecode manipulation (think genetic programming) in Python.

I came across a test case in crashers test section of Python source tree that states:

Broken bytecode objects can easily crash the interpreter. This is not going to be fixed.

Thus the question, how to validate given tweaked byte code that it will not crash interpreter? Is it even possible?

Test source, after http://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html

cc = (lambda fc=(
    lambda n: [
        c for c in
            ().__class__.__bases__[0].__subclasses__()
            if c.__name__ == n
        ][0]
    ):
    fc("function")(
        fc("code")(
            0, 0, 0, 0, "KABOOM", (), (), (), "", "", 0, ""
        ), {}
    )()
)

Here, this module defines cc that, if called, mymod.cc() crashes interpreter. Granted this is a very tricky example that created new code object with custom bytecode "KABOOM" and then runs it.

I'd accept something that verifies predefined bytecode, e.g. from a .pyc file.

Dima Tisnek
  • 11,241
  • 4
  • 68
  • 120
  • 4
    I know of no method that'll validate bytecode, no. This is a hard task; better just produce valid bytecode. – Martijn Pieters Apr 24 '14 at 11:47
  • 3
    I think this may be undecidable. Suppose you have bytecode equivalent to: `if method_that_may_loop_forever(): crash()`. you would have to solve the [Halting Problem](http://en.wikipedia.org/wiki/Halting_problem) to determine whether it will crash or not. – Kevin Apr 24 '14 at 11:53
  • 3
    @Kevin I surely don't want to solve halting problem. I only want to determine if a particular bytecode sequence is guaranteed safe or is potentially unsafe. Similar to what JVM does. – Dima Tisnek Apr 24 '14 at 13:57
  • oh, ok, that's possible, then :-) I don't personally know of any method to do it, however. – Kevin Apr 24 '14 at 13:59
  • 1
    Why would you want to generate bytecode directly, if one can generate python source code and execute it instead? First approach is not well documented, lacks tools, etc... Are there serious disadvantages of the source code generation for your case? – Tim Sep 27 '14 at 07:37
  • 2
    In genetic programming a quality or fitness is being optimized. If the ratio of invalid candidates is too high, genetic algorithms are ineffective. Better ensure candidates are correct by construction, so that a fitness can be calculated. Difficult, though! – cfi Sep 27 '14 at 20:12
  • Adding to what Timur said; bytecode generation is less portable. There's very little with regards to stability guarantees between versions of any Python interpreter, never mind Python the language. – Veedrac Sep 28 '14 at 20:27

3 Answers3

3

Using a byte code Assembler does the Stack tracking across jumps, globally verifying stack level prediction consistency and automatically rejecting attempts to generate dead code. It is virtually impossible to accidentally generate bytecode that can crash the interpreter.

This Link might help you.

devst3r
  • 552
  • 7
  • 25
  • 1
    Good library/link, I have to verify it does indeed validate byte code to the degree I want. I suspect that validation of arbitrary code jumps is technically undecidable, thus the question is whether byte code assembler errs on the side of caution (rejects potentially bad bytecode) or on the side of user (allows potentially good code). – Dima Tisnek Sep 23 '14 at 12:51
  • if you use the BytecodeAssembler module (http://pypi.python.org/pypi/BytecodeAssembler), you won't need to figure out these stuff. For that matter, it has lots of support for labels, block handling, etc. The full manual for it is at (http://peak.telecommunity.com/DevCenter/BytecodeAssembler) – devst3r Sep 29 '14 at 06:24
  • 1
    I'm afraid this statement only applies to generating code using `BytecodeAssembler` and not when parsing existing byte code. It may prove possible to map existing byte code to sequences of API calls though, I'm trying to hack something up... – Dima Tisnek Sep 30 '14 at 08:23
1

Python might be not an ideal language for such tasks, for the reasons stated in the question.

One approach: Don't create or accept raw bytecode, accept only Python source code and compile it yourself.

Further, there exists libraries (RestrictedPython) which manipulate Python on AST level to have some security guarantees e.g. to prevent sandbox escaping.

Mikko Ohtamaa
  • 82,057
  • 50
  • 264
  • 435
  • Please correct me if I'm wrong, but `RestrictedPython` requires python source as input, does it not? – Dima Tisnek Sep 30 '14 at 07:17
  • Yes. That's AST - Abstract Syntax Tree. If you want to be pedantic that's not the source code itself. Thus, the disclaimer (might not suit for your approach). – Mikko Ohtamaa Sep 30 '14 at 07:34
1

Both outdated, the first one without code (at least I can't find) but may be useful to give an idea of what/how can be done and what are the limitations.

perfectly valid bytecode can still do horrible things

Alex
  • 3,264
  • 1
  • 25
  • 40
  • At least for a simple `"KABOOM"` is noticed by `Python-Bytecode-Verifier` with `verifier.VerificationError: Unverifiable code: Stack underflow. Offset: 0 Stack: 0 Boundary: 0 Required: 2`; of course the package is hopelessly outdated. – Dima Tisnek Sep 30 '14 at 11:56