1

I'm making a simple parser for some Java-like language (just for learning purposes). I'm having trouble determining whether a statement is a variable declaration. This may be a problem with my lexer (which is really sloppy). If the lexer sees some text, it simply labels it an identifier, even if that text is a keyword or a type. The job of telling those apart is given to the parser.

This has worked so far, but now I'm trying to parse variable declarations, like this one here:

int x = 3;

The problem is I don't know how to determine whether this is a variable declaration. If I just look at the first token and find that it's an "identifier", that doesn't tell me anything, since this line of code also starts with an identifier:

System.out.print("hi");

And statements like this are handled by another part of the parser.

Another solution I thought about was checking to see if the first token is a type. For example, I could have a method that looks something like:

boolean isType(String t) {
    if( t.equals("int")  ||
        t.equals("long") ||
        t.equals("char") ||
        /* et cetera */ )
        return true;
    else return false;
} 

The problem with this is that it only allows for a certain set of types. Since my little language is compiled to Java bytecode, I need it to recognize arbitrary classes as types.

So my question is: is it possible to determine whether a statement is a variable declaration or not, without knowing all the possible variable types?

  • You could be optimistic and _hope_ that the Java code conforms to style guides, in which case class names would all have their first letter capitalized. This is the simplest case at least. – Hunter McMillen Aug 08 '12 at 19:33
  • possible? yes, what else would look like a word followed by a space followed by another word followed by an equality sign? Needed? probably not, java itself knows all types on compile time, it gives you errors if you try to compile a file with an unknown type – Benjamin Gruenbaum Aug 08 '12 at 19:34
  • 1
    if you have two identifiers following each other, what else can it be? – Qnan Aug 08 '12 at 19:36

4 Answers4

2

Another solution is to have the parser and the lexer collaborate using a symbol table. Once the parser has determined that a new type name has been declared, it would insert that name in the symbol table as a type name. The lexer, in turn, would consult the symbol table to see if the new identifier-like word is a type name or not, and choose the correct token type accordingly.

There are complications, however.

  • If the language allows for an inner scope to redefine a type name as the name of another type or as a non-type identifier, the symbol table must understand scoping and the parser must inform the symbol table when a scope ends.
  • If the language allows a type name to be an ordinary identifier in some contexts, the parser must be able to cope with that.
  • If the parser backtracks, it must remember to undo symbol table changes as well.

It's not quite as clean as having the lexer be oblivious to context, but in return it (in some cases) allows the parser to avoid excessive lookahead and backtracking; though I think a Java parser doesn't necessarily need that sort of help.

ibid
  • 3,891
  • 22
  • 17
1

When you read the first word you don't know it its a declaration or not but you don't need to.

When you get the next separator you know what it is for.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
1

I've had to do something of this flavor for a class, about 4 years ago. Although I don't remember all the details of the "official" way to do it, but

What I would do, is I would look ahead, at the future symbols to determine whether or not it's a variable declaration, so as Benjamin Gruenbaum said, if you see a legal identifier(at the beginning of a line) followed by another legal identifier, than the first one is probably a variable declaration.

  • I just tried this and it works fine. I can't think of any case where two identifiers would be anything but a variable declaration either (except a function declaration, but that's handled elsewhere, so there shouldn't be any trouble). Thanks! –  Aug 08 '12 at 19:56
  • @Hassan A function declaration will also have a list of parameters after it, or at the very least a `()` after it, which is just one more lookahead if you need to distinguish them from fields or something. – Sam I am says Reinstate Monica Aug 08 '12 at 19:59
  • Yes, but there are usually some modifiers before it, like "public", "private", "static", etc. Of course this is somewhat Java-specific. –  Aug 08 '12 at 20:02
1

You should probably read book on compiler design, and probably look at the lex and yacc code before trying this. Or you could google writing a compiler

IIRC, and it's been a while, first you break your source file into a parse tree, then you walk the parse tree to generate the object code. When you're breaking the source file down, you check each token against your list of key word tokens.

In your example your lexer would see 'int', and process it, looking for the variable declarations that must follow the keyword ( or precede it, depending on your language definition).

This makes it seem easy, however there is a reason why most people use a tool like flex or lex to create the parse tree.

Jim Barrows
  • 3,634
  • 1
  • 25
  • 36
  • Thanks for the answer. I will probably have to do more reading to better understand compilers, but I've already done the steps you mentioned (parse tree and code generation), and I can compile some simple code now. And I'm doing this for my own education, so I'd like to avoid lex or yacc. –  Aug 08 '12 at 19:46