4

I am currently constructing a Java decompiler.

In order to assist in pattern recognition, I am constructing a simple grammar through ANTLR and using ANTLRWorks interpreter to debug.

Below's the preliminary grammar so far. In going down this route, I am assuming that I am able to simplify certain JVM byte-code into expressions that the grammar below can detect.

What problems do you see in this approach? Updated grammar for Ira's comments on June 29 2:36 GMT 1

    grammar JVM;

options {k=3;}

WS  :   (' '|'\r'|'\n'|'\t')+ {$channel=HIDDEN;}
    ;
INT :   ('0'..'9')+ ;
UINT    :   ('_' INT)?;
IFEQ    :   'ifeq';
IFGE    :   'ifge';
IFGT    :   'ifgt';
IFLE    :   'ifle';
IFLT    :   'iflt';
IFNE    :   'ifne';
IFACMP_CONDTYPE :   'if_acmp' ('eq'|'ne'|'lt'|'ge'|'gt'|'le');
// THIS :   'aload_0';
LDC :   'ldc2_w'|'ldc_w'|'ldc';
LOADREFERENCE
//  :   THIS
    : 'aload' UINT;
//  | 'aload_2'
//  | 'aload_3';
DLOAD   :   'dload' UINT;
LOADINT :   'iload_0'
    |   'iload_1'
    | 'iload_2'
    | 'iload_3'
    ;
DCONST  :   'dconst' UINT;  
ICONST  :   'iconst' UINT;

goal    :   jvmStatement2+ ;

//fragment
//jvmStatement1
//  :   returnStatement
//  | newArrayStatement
//  | storeStatement
//  | assignmentStatement
//  | assertStatement
//  | invokeStatement
//  | ifStatement
//  | gotoStatement
//  ;

fragment // to test assert
jvmStatement2
    : returnStatement     // 2
    | newArrayStatement   // 3
    | storeStatement      // 4
    | invokeStatement     // 5
    | assignmentStatement // 6
    | assertStatement     // 7
    | ifStatement         // 8  
    | gotoStatement
    ;

fragment
setAssertionStatus
    :   ifStatement pushIntegerConstant
    gotoStatement pushIntegerConstant setStaticFieldInClass;

fragment
fetchFieldFromObject
    :   LOADREFERENCE 'getfield' INT;

fragment
loadDoubleFromLocalVariable
    :   DLOAD;

fragment
loadFloatFromLocalVariable
    :   'fload' UINT;

fragment
loadIntFromLocalVariable
    :   LOADINT;

fragment
loadLongFromLocalVariable
    :   'lload' UINT;   

fragment
loadReferenceFromLocalVariable
    :   'aload' UINT;

fragment
loadReferenceFromArray
    :   'aaload';

fragment
storeReference
    : storeIntoByteOrBooleanArray;  

fragment
storeReferenceIntoLocalVariable
    :   'astore' UINT;

fragment
storeDoubleIntoLocalVariable
    :   'dstore' INT;

fragment
storeFloatIntoLocalVariable
    :   'fstore' UINT;

fragment
storeIntIntoLocalVariable
    :   'istore' (INT|UINT);

fragment
storeLongIntoLocalVariable
    :   'lstore' UINT;  

fragment
storeIntoByteOrBooleanArray
    :   'bastore';

fragment
storeIntoReferenceArray
    :   'aastore';

fragment
pushNull:   'aconst_null';

fragment
pushByte:   'bipush' INT;

fragment
pushIntegerConstant
    :   ICONST;

fragment
pushDoubleConstant
    :   DCONST;

fragment
pushLongConstant
    :   'lconst' UINT;

fragment
pushFloatConstant
    :   'fconst' UINT;

fragment
pushItemFromRuntimeConstantPool
    :   LDC INT;


fragment invokeStatementArgument: constantExpr
    | createAnonymousClass;

fragment createAnonymousClass
    :   createNewObject dup thisInstance;

fragment invokeStatementArguments: invokeStatementArgument*;

fragment invokeStatement: getStaticField? invokeStatementArguments invokeMethod;    

fragment
invokeMethod
    : invokeInstanceMethod
    | invokeVirtualMethod
    | invokeStaticMethod
    ;

fragment
invokeInstanceMethod
    :   'invokespecial' INT;

fragment
invokeVirtualMethod
    :   'invokevirtual' INT;    

fragment
invokeStaticMethod
    :   'invokestatic' INT;

fragment
newArrayStatement
    :   'newarray' simpleType;

fragment
setFieldInObject
    :   'putfield' INT;

fragment setStaticFieldInClass
    :   'putstatic' INT;

fragment
simpleType
    :   ('boolean'|'byte'|'char'|'double'|'float'|'int'|'long'|'short');

fragment
returnVoid
    :    'return';
fragment
returnSimpleType
    :   returnReference
    | returnDouble
    | returnFloat
    | returnInteger
    | returnLong;

fragment
returnReference
    :    'areturn';
fragment
returnDouble
    :   'dreturn';
fragment returnFloat
    :   'freturn';
fragment returnInteger
    :   'ireturn';
fragment returnLong
    :   'lreturn';  

fragment
returnStatement
    :   returnVoid 
    | constantExpr returnSimpleType;    

fragment
dupX1
    :   'dup_x1';

fragment
dup
    :   'dup';  

fragment
storeStatement
    : storeReferenceIntoLocalVariable 
    | storeIntIntoLocalVariable
    | setStaticFieldInClass
    | storeIntoReferenceArray
    | setFieldInObject;

fragment
convertDouble
    :   convertDoubleToFloat | convertDoubleToInt | convertDoubleToLong;

fragment
convertDoubleToFloat
    :   'd2f';

fragment
convertDoubleToInt
    :   'd2i';

fragment
convertDoubleToLong
    :   'd2l';

fragment
convertFloat
    :   convertFloatToDouble|convertFloatToInt|convertFloatToLong;

fragment
convertFloatToDouble
    :   'f2d';
fragment
convertFloatToInt
    :   'f2i';
fragment
convertFloatToLong
    :   'f2l';  

fragment
convertInt
    :   convertIntToByte
    |convertIntToChar
    |convertIntToDouble
    |convertIntToFloat
    |convertIntToLong
    |convertIntToShort;

fragment
convertIntToByte
    :   'i2b';

fragment
convertIntToChar
    :   'i2c';

fragment
convertIntToDouble
    :   'i2d';

fragment
convertIntToFloat
    :   'i2f';

fragment
convertIntToLong
    :   'i2l';

fragment
convertIntToShort
    :   'i2s';

fragment
branchComparison
    :branchIfReferenceComparison
    |branchIfIntComparison
    |branchIfIntComparisonWithZero
    |branchIfReferenceNotNull
    |branchIfReferenceNull; 

fragment
branchIfReferenceComparison
    :   'if_acmp' condType;

fragment
branchIfIntComparison
    :   'if_icmp' condType INT;

fragment
branchIfIntComparisonWithZero
    :   (IFEQ|IFGE|IFGT|IFLE|IFLT|IFNE) INT;

fragment
gotoStatement
    :   'goto' INT;

fragment
ifStatementCompare
    :   (IFEQ INT)
    |   (IFNE INT);

fragment
ifStatement
    :   booleanExpression ifStatementCompare;

fragment
ifType  : 'ifeq'
 |'ifne'
 |'iflt'
 |'ifge'
 |'ifgt'
 |'ifle';

fragment
branchIfReferenceNotNull
    :   'ifnonnull' ;

fragment
branchIfReferenceNull
    :   'ifnull';

fragment
condType:   'eq'
 |'ne'
 |'lt'
 |'ge'
 |'gt'
 |'le';

fragment
checkCast
    :   'checkcast' INT;

fragment
createNewArrayOfReference
    :   constantExpr 'anewarray' INT;

fragment
createNewObject
    :   'new' INT;

fragment
assignmentStatement
//  : pushItemFromRuntimeConstantPool storeStatement
    : (constantExpr)+ storeStatement
    | invokeInheritedConstructor
    | expressionStatement
//  | setAssertionStatus
    ;

fragment
invokeInheritedConstructor
    :   loadReferenceFromLocalVariable invokeInstanceMethod;

fragment
throwExceptionOrError
    :   'athrow';

fragment
getStaticField
    :   'getstatic' INT;

fragment
newInstance
    :   'new' INT;

fragment // this needs to be extended to recognize more patterns
booleanExpression
    :   integerComparison
    | loadIntFromLocalVariable
    | invokeMethod;

fragment
integerComparison
    : loadIntFromLocalVariable loadIntFromLocalVariable branchIfIntComparison;  

fragment assertIfAssertEnabled: getStaticField branchIfIntComparisonWithZero;

fragment assertCondition:booleanExpression branchIfIntComparisonWithZero;

fragment assertThrow:createNewObject dup assertMessage throwExceptionOrError;

fragment assertMessage:pushItemFromRuntimeConstantPool invokeMethod;

fragment assertStatement:assertIfAssertEnabled assertCondition assertThrow;


fragment
stringPlusNumber
    :pushItemFromRuntimeConstantPool invokeMethod 
 loadReferenceFromLocalVariable invokeMethod invokeMethod invokeMethod;

fragment expressionStatement:   statementExpression;

fragment
statementExpression 
    :   preIncrementExpression
    | preDecrementExpression
//  | postIncrementExpression
//  | postDecrementExpression
    | newByteArray
    | ternaryExpression
    | createAndStoreObject // assignment expression
    | createNewArrayStatement
    | fetchFieldFromObject
    ;

fragment
createNewArrayStatement // with elements
    :   createNewArrayOfReference createNewArrayInitElement+;

createNewArrayInitElement
    : (dup constantExpr getStaticField storeStatement);

fragment
createAndStoreObject
    :   createNewObject dup invokeStatement storeStatement;

fragment ternaryExpression // doesn't cover all situations yet
    : loadIntFromLocalVariable ifStatementCompare loadIntFromLocalVariable gotoStatement
    loadIntFromLocalVariable storeStatement;    

fragment preIncrementExpression: preIncrementInteger;

fragment preDecrementExpression: preDecrementFloat|preDecrementLong|preDecrementDouble; 

fragment doubleExpression: pushDoubleConstant;

fragment integerExpression: pushIntegerConstant;

fragment longExpression: pushLongConstant;

fragment floatExpression: pushFloatConstant;

fragment preIncrementInteger: loadReferenceFromLocalVariable dup fetchFieldFromObject integerExpression 
    iAdd dupX1? setFieldInObject;

fragment preDecrementDouble: loadDoubleFromLocalVariable doubleExpression dSub storeDoubleIntoLocalVariable;

fragment preDecrementLong: loadLongFromLocalVariable longExpression lSub storeLongIntoLocalVariable;

fragment preDecrementFloat: loadFloatFromLocalVariable floatExpression fSub storeFloatIntoLocalVariable;

fragment newByteArray: newByteArrayWithNull|newByteArrayWithData;

// byte[] b = {'c', 'h', 'u', 'a'};
fragment newByteArrayWithData:  constantExpr newArrayStatement byteArrayElements;

fragment byteArrayElements: constantExpr constantExpr storeIntoByteOrBooleanArray;  

fragment constantExpr: 
    //loadReferenceFromLocalVariable
    LOADREFERENCE
    |loadDoubleFromLocalVariable
    |loadFloatFromLocalVariable
    |loadIntFromLocalVariable
    |loadLongFromLocalVariable
    |pushByte
    |pushDoubleConstant
    |pushFloatConstant
    |pushIntegerConstant
    |pushItemFromRuntimeConstantPool
    |pushLongConstant
    |pushNull
    |fetchFieldFromObject
    ;

// byte[] c = null;
// String s = null;
fragment newByteArrayWithNull: pushNull (checkCast)? storeReference;

fragment thisInstance:  LOADREFERENCE invokeMethod;

fragment ternaryOperator
    :   ifStatementCompare pushIntegerConstant gotoStatement pushIntegerConstant setStaticFieldInClass;

fragment floatMultiply
    :   constantExpr constantExpr dMul;

fragment iAdd: 'iadd';      
fragment dSub: 'dsub';
fragment fSub: 'fsub';
fragment lSub: 'lsub';
fragment lAdd: 'ladd';  
fragment dMul: 'dmul';

For example, the current grammar (a further evolution of the above) can turn

getstatic 25
ifne 25
iload_1
iload_2
if_icmpgt 25
new 25
dup
invokespecial 44
athrow
return

into

enter image description here

Community
  • 1
  • 1
chuacw
  • 1,685
  • 1
  • 23
  • 35
  • 1
    Minor suggestion. What questions on so are related to this one? – r4. Jun 20 '12 at 10:07
  • 1
    This is an interesting question. Will this be an open source project? You might want to have a look at jreversepro. – carlspring Jun 25 '12 at 08:20
  • What with the "fragment" business? I'm not an ANTLR expert; but I'm expecting rules to look like "lhs : rhs1 | rhs2 | ... | rhsn ;" not "fragment lhs : ... " – Ira Baxter Jun 29 '12 at 03:00
  • Ira, fragment are subrules - http://www.antlr.org/wiki/display/ANTLR3/1.+Lexer You can just ignore the "fragment" keyword, and pretend that fragment XXX is the same as XXX. – chuacw Jun 29 '12 at 03:05
  • carlspring, too early to talk about whether it'll be open source or not. – chuacw Jul 03 '12 at 15:18

2 Answers2

3

If all you want to recognize are the individual JVM instructions then a grammar might be OK. You'll probably spend time fiddling with the grammar to get the details right. This might be simple overkill. A byte-opcode-driven finite state automaton (FSA) implemented as a giant case statement might be easier; after all, the JVM instructions are supposed to be easy to decode so that a semi-fast interpreter could execute those instructions.

Based on vague recall, there are other sections (tables, e.g., literals) in the class file. You can probably recognize them, too, with the parser but also likely overkill.

You have the second problem of collecting the instruction/table information after you recognize them; parser generators tend to want to help you build some kind of AST. The instructions aren't an AST; they're at least a linear chain and if you include the jump targets, they form a graph with references to the tables. So I suspect you'll end up struggling to get the semantic actions to collect the data the way you want.

Ands its the graph you likely want to capture. To the extent that the graph has some kind of hierarchical structure (being derived from a structured programming language), you might want to discover that hierarchy. The parser approach contributes nothing here.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • Ira, can you expand FSA? Do you mean Finite State Automata? – chuacw Jun 20 '12 at 13:54
  • @chuacw: yes, that's what I meant. Revised answer accordingly. – Ira Baxter Jun 20 '12 at 13:55
  • The parser approach is actually for me to quickly test out byte code recognition (using text), instead of writing high level code that recognizes bytecodes by themselves. Once I get the grammar correct, I will rewrite it as high level code instead. – chuacw Jun 20 '12 at 13:56
  • @chuacw: I think its more work than its worth. I suspect you need just one big case statement, and its pretty hard to have a simpler code skeleton than that. If you put tracing information at the top of the case statement ("at offset N I'm decoding byte code M") and in each arm ("I've decoded a PUSH LITERAL with literal value 17") I think your debugging process will go extremely fast. – Ira Baxter Jun 20 '12 at 14:00
  • Thanks, Ira. The problem I got with the case statement approach is that I do not have any idea how to organize each byte code into recognizable patterns. On the other hand, with the grammar approach, when I turn byte code (eg, 0x62, 0x30, for example) into their textual equivalent (eg fadd, faload), the grammar can tell me which patterns it has recognized. – chuacw Jun 20 '12 at 14:10
  • I think you're overthinking the problem. Sketch the loop/switch statement. Code up a few opcodes (e.g, fadd) to see what it looks like. I think you'll find this actually pretty easy; after all, you end up with 256 special cases, each of which should be easy, one for each opcode :-} – Ira Baxter Jun 20 '12 at 14:14
  • Ira, I already have a switch statement (covering all opcodes). The problem I currently face is aggregating several bytecodes into a recognized pattern. The grammar helps with that. – chuacw Jun 20 '12 at 14:23
  • So, "its the graph you likely want to capture". You'll pick up perhaps little subexpressions but a tree-oriented device (like ANTLR) will not be able to track the long-range use of values on the stack. If you're happy with what you're getting (you are certainly pushing this line hard), then your approach is fine. I just don't see it going the distance; its the wrong way to recognize a graph. – Ira Baxter Jun 20 '12 at 14:30
  • I agree with Ira about over-thinking. A decompiler is a recursive pattern matcher. Patterns correspond to source level statement types. A pattern matches a sequence of items, where an item is either an instruction or another pattern (this last is the recursion). The decompiler is a heuristic bottom-up search for a set of patterns that exactly covers the entire instruction stream. Tokenizing the instruction stream is step zero and a very simple step compared to designing the patterns and the search. – Gene Jun 26 '12 at 18:14
1

This approach has the problem of recognizing the nesting of parameters.

For example, given the declaration,

int func1(int x, int y, int z) {
    return 0;
}

int func0() {
    return 0;
}

and the call

Object[] x = new Object[func1(2, 3, 4)];
x = new Object[func0()];
x = new Object[func1(func1(func1(0, 1, 2), 3, 4), 5, 6)];

which generates the following bytecode:

Offset  Instruction       Comments (Method: none)
0       aload_0           (cheewee.helloworld.test000031_applet this)
1       iconst_2
2       iconst_3
3       iconst_4
4       invokevirtual 79  (cheewee.helloworld.test000031_applet.func1)
7       anewarray 81      (java.lang.Object)
10      astore_1          (java.lang.Object[] x)
11      aload_0           (cheewee.helloworld.test000031_applet this)
12      invokevirtual 83  (cheewee.helloworld.test000031_applet.func0)
15      anewarray 81      (java.lang.Object)
18      astore_1          (java.lang.Object[] x)
19      aload_0           (cheewee.helloworld.test000031_applet this)
20      aload_0           (cheewee.helloworld.test000031_applet this)
21      aload_0           (cheewee.helloworld.test000031_applet this)
22      iconst_0
23      iconst_1
24      iconst_2
25      invokevirtual 79  (cheewee.helloworld.test000031_applet.func1)
28      iconst_3
29      iconst_4
30      invokevirtual 79  (cheewee.helloworld.test000031_applet.func1)
33      iconst_5
34      bipush 6
36      invokevirtual 79  (cheewee.helloworld.test000031_applet.func1)
39      anewarray 81      (java.lang.Object)
42      astore_1          (java.lang.Object[] x)
43      return

It is unable to detect that there are nesting involved. I'm not sure if this is a limitation of ANTLR or if this is a limitation of my learnings on how to write an ANTLR grammar.

The next step would be to use a hybrid method, simplifying groups of bytecode into tokens (so as to recognize them as simpler patterns) first, before passing it into a parser for detecting higher level patterns.

chuacw
  • 1,685
  • 1
  • 23
  • 35
  • The grammar you've proposed in your question appears to be focused on recognizing individual opcodes in spite of "expression" like nonterminals. If you only want the individual opcodes, I think ANTLR is complete overkill; stick with the opcode FSA. But if you are going to recognize "expressions", then you need grammar rules that say an expression can be a series of expressions followed by an invokevirtual opcode. Abstractly I would expect would pick up the parameters, but you'll probabaly need infinite lookahead to ensure that the parser can "guess" the right number of parameters per call – Ira Baxter Jun 28 '12 at 10:37
  • Ira, the later versions of the grammar I have, there are expressions followed by an invokexxx opcode. Unfortunately, the same problem occurs. Nesting of expressions can occur. – chuacw Jun 29 '12 at 02:08
  • Right. Nesting of expressions occur in grammars for conventional languages, too, and they can be recognized by conventional parsers. *If* JVM code is always generated in structured blocks, then you can probably recognize the structures with the grammar; as I observed earlier, if it is organized as an aritrary graph (like assembler code), a context free grammar is going to have a hard time. Sticking with the former case, you need productions like: stmt = exp; stmt = exp jlt_opcode target ; stmt = exp store_instruction ; as well as exp = push_literal literal; exp = push_variable var_name – Ira Baxter Jun 29 '12 at 02:32
  • exp = exp exp add_opcode; exp = expression_list invokevirtual ; expression_list = ; expression_list = expression_list expression ; This last should handle nested expressions. Is your current grammar organized something like this? – Ira Baxter Jun 29 '12 at 02:36
  • Ira, I updated the grammar in response to your comment above. InvokeStatements. The grammar has problems with the code at Offset 19 to Offset 42, which is x = new Object[func1(func1(func1(0, 1, 2), 3, 4), 5, 6)] – chuacw Jun 29 '12 at 02:50
  • Your (revised) grammar layout doesn't match what I suggested. You need a nonterminal ("exp") that can recognize a sequence of instructions that push exactly *one* value on the stack. Your grammar has arguments for invokevirtual, sort of as I have suggested, but the individual arguments have be arbitrary expressions (including sequences that push arguments and invokevirtual); you've limited them to just constant_exprs. – Ira Baxter Jun 29 '12 at 03:06
  • ... having said that, I can now see a potential problem: does invokevirtual push its return value on the stack, replacing the arguments? Or can an invokevirtual call a method that returns "void"? (Hmm, if it does, that call won't be in the middle of an RPN expression, so maybe there isn't a problem.) – Ira Baxter Jun 29 '12 at 03:08
  • You had two questions in your comments. Answer to Q1) invokevirtual can return void (nothing is pushed onto the stack), and it can also return any simple type, or any reference (something is pushed onto the stack). And invokevirtual (and other invokexxx calls) can take any number of expressions (depending on how it's declared at the higher level). Answer to Q2) invokevirtual can call a method that returns void (or returns anything else). – chuacw Jun 29 '12 at 03:12
  • invokevirtual on a function that returns void can only be a statement, so you need a grammar rule "stmt = expressionlist invokevirtual;", and invokevirtual on a function that returns a non-viod value can only be in the middle of an expression (much like "push"), thus the need for a grammar rule "exp = expressionlist invokevirtual". – Ira Baxter Jun 29 '12 at 03:46
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/13203/discussion-between-chuacw-and-ira-baxter) – chuacw Jun 29 '12 at 05:11
  • I won't do that. I've had bad experiences with chat, accepting a long response and then throwing it away. Writing a long response is difficult enough; to be rewarded by losing it is a show stopper and I won't do that again. I'd rather suffer from short interactions here; these don't get thrown away like that. – Ira Baxter Jun 29 '12 at 10:31
  • Sure, I've no problems with that. I just don't know whether it's bad etiquette not to do so or not, since there was a comment that said "Please avoid extended discussions in comments". – chuacw Jun 29 '12 at 16:03
  • The SO engineers don't like this for some reason, but their chat solution isn't one. So, mexican standoff. In the meantime... do you want me to edit you grammar to show you what I think you need? (I can modify your question text). – Ira Baxter Jun 30 '12 at 03:56
  • ... maybe better, check out my bio and send me an email. – Ira Baxter Jun 30 '12 at 03:59
  • Bounty to @chuacw. Seems to me like some points would do you good. (This is a "special occation". Have absolutely no intentions to misuse bounties in the future). Thanks Ira for interesting input! – r4. Jun 30 '12 at 14:12