0

I need to write a lexer for a java source code plagiarism detector. Here is an example what I want to achieve.

//Java code                                   Tokens:
public class Count {                          Begin Class
    public static void main(String[] args)    Var Def, Begin Method
        throws java.io.IOException {
      int count = 0;                          Var Def, Assign
      while (System.in.read() != -1)          Apply, Begin While
        count++;                              Assign, End While
      System.out.println(count+" chars.");    Apply

    }                                         End Method
}                                             End Class

I think Jflex is the right tool to generate the lexer. However after looking through some examples. I cannot find a way to distinguish class brackets and method brackets. Most tokenizers I find just recognize them as same token. Also how do I distinguish a method apply from a variable identifier?

F. Zhao
  • 13
  • 4

1 Answers1

4

I cannot find a way to distinguish class brackets and method brackets.

There is nothing lexically different about them. "{".equals("{"). The way you distinguish them is by context in the parser. The lexer can't make that distinction, nor should it.

Also how do I distinguish a method apply from a variable identifier

In the lexer, you don't. An identifier is an identifier. The token stream generated from "f(x)" should be Identifier, OpeningParenthesis, Identifier, ClosingParenthesis.

Now in the parser you'll recognize a function name by the fact that it's followed by an opening parentheses, but again that's the parser's, not the lexer's, job.

sepp2k
  • 363,768
  • 54
  • 674
  • 675
  • Thank you for your clarification. Is there any existing example code or tool that can be modifier to parse the code in my way? – F. Zhao Nov 24 '16 at 00:54
  • @Y.Zhao There are example Java grammars for various parser combinators, but I couldn't find a current one for one that you'd use together with JFlex. I don't think JFlex+Cup or JFlex+BYaccJ or popular combinations anymore (if they ever were). If you're not married to JFlex, you should have an easy time finding a current Java grammar for Antlr. – sepp2k Nov 24 '16 at 18:27