2

What approach would allow me to get the most on reporting lexing errors?

For a simple example I would like to write a grammar for the following text

(white space is ignored and string constants cannot have a \" in them for simplicity):

myvariable = 2
myvariable = "hello world"

Group myvariablegroup {
    myvariable = 3
    anothervariable = 4
}

Catching errors with a lexer

How can you maximize the error reporting potential of a lexer?

After reading this post: Where should I draw the line between lexer and parser?

I understood that the lexer should match as much as it can with regards to the parser grammar but what about lexical error reporting strategies?

What are the ordinary strategies for catching lexing errors?

I am imagining a grammar which would have the following "error" tokens:

GROUP_OPEN: 'Group' WS ID WS '{';
EMPTY_GROUP: 'Group' WS ID WS '{' WS '}';
EQUALS: '=';
STRING_CONSTANT: '"~["]+"';
GROUP_CLOSE: '}';
GROUP_ERROR: 'Group' .; // the . character is an invalid token
                        // you probably meant '{'
GROUP_ERROR2: .'roup' ; // Did you mean 'group'?
STRING_CONSTANT_ERROR: '"' .+; // Unterminated string constant
ID: [a-z][a-z0-9]+;
WS: [ \n\r\t]* -> skip();
SINGLE_TOKEN_ERRORS: .+?;
Community
  • 1
  • 1
Har
  • 3,727
  • 10
  • 41
  • 75
  • possible duplicate of [Where should I draw the line between lexer and parser?](http://stackoverflow.com/questions/5362078/where-should-i-draw-the-line-between-lexer-and-parser) – Terence Parr Jul 14 '14 at 00:46
  • You have to fairly broad question here. Can you narrow things down? perhaps you can look at the section in the book called "Drawing the Line Between Lexer and Parser" on page 79. I also suggest that you do a search before asking questions. This question is a duplicate of http://stackoverflow.com/questions/5362078/where-should-i-draw-the-line-between-lexer-and-parser – Terence Parr Jul 14 '14 at 00:47

1 Answers1

2

There are clearly some problems with your approach:

  • You are skipping WS (which is good), but yet you're using it in your other rules. But you're in the lexer, which leads us to...

  • Your groups are being recognized by the lexer. I don't think you want them to become a single token. Your groups belong in the parser.

  • Your grammar, as written, will create specific token types for things ending in roup, so croup for instance may never match an ID. That's not good.

  • STRING_CONSTANT_ERROR is much too broad. It's able to glob the entire input. See my UNTERMINATED_STRING below.

  • I'm not quite sure what happens with SINGLE_TOKEN_ERRORS... See below for an alternative.

Now, here are some examples of error tokens I use, and this works very well for error reporting:

UNTERMINATED_STRING
    :   '"' ('\\' ["\\] | ~["\\\r\n])*
    ;

UNTERMINATED_COMMENT_INLINE
    :   '/*' ('*' ~'/' | ~'*')*? EOF -> channel(HIDDEN)
    ;

// This should be the LAST lexer rule in your grammar
UNKNOWN_CHAR
    :   .
    ;

Note that these unterminated tokens represent single atomic values, they don't span logical structures.

Also, UNKNOWN_CHAR will be a single char no matter what, if you define it as .+? it will always match exactly one char anyway, since it will be trying to match as few chars as possible, and that minimum is one char.
Non-greedy quantifiers make sense when something follows them. For instance in the expression .+? '#', the .+? will be forced to consume characters until it encounters a # sign. If the .+? expression is alone, it won't have to consume more than a single character to match, and therefore will be equivalent to ..

I use the following code in the lexer (.NET ANTLR):

partial class MyLexer
{
    public override IToken Emit()
    {
        CommonToken token;
        RecognitionException ex;

        switch (Type)
        {
            case UNTERMINATED_STRING:
                Type = STRING;
                token = (CommonToken)base.Emit();
                ex = new UnterminatedTokenException(this, (ICharStream)InputStream, token);
                ErrorListenerDispatch.SyntaxError(this, UNTERMINATED_STRING, Line, Column, "Unterminated string: " + GetTokenTextForDisplay(token), ex);
                return token;

            case UNTERMINATED_COMMENT_INLINE:
                Type = COMMENT_INLINE;
                token = (CommonToken)base.Emit();
                ex = new UnterminatedTokenException(this, (ICharStream)InputStream, token);
                ErrorListenerDispatch.SyntaxError(this, UNTERMINATED_COMMENT_INLINE, Line, Column, "Unterminated comment: " + GetTokenTextForDisplay(token), ex);
                return token;

            default:
                return base.Emit();
        }
    }

    // ...
 }

Notice that when the lexer encounters a bad token type, it explicitly changes it it to a valid token, so the parser can actually make sense of it.

Now, it is the job of the parser to identify bad structure. ANTLR is smart enough to perform single-token deletion and single-token insertion while trying to resynchronize itself with an invalid input. This is also the reason why I'm letting UNKNOWN_CHAR slip though to the parser, so it can discard it with an error message.

Just take the errors it generates and alter them in order to present something nicer to the user.

So, just make your groups into a parser rule.


An example:

Consider the following input:

Group ,ygroup {

Here, the , is clearly a typo (user pressed , instead of m).

If you use UNKNOWN_CHAR: .; you will get the following tokens:

  • Group of type GROUP
  • , of type UNKNOWN_CHAR
  • ygroup of type ID
  • { of type '{ '

The parser will be able to figure out the UNKNOWN_CHAR token needs to be deleted and will correctly match a group (defined as GROUP ID '{' ...).

ANTLR will insert so-called error nodes at the points where it finds unexpected tokens (in this case between GROUP and ID). These nodes are then ignored for the purposes of parsing, but you can retrieve them with your visitors/listeners to handle them (you can use a visitor's VisitErrorNode method for instance).

Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
  • I agree with you on the making the groups into parser rules I think that is clearer. The example that you give above whereby you know that that is meant to be a comment which is unterminated seems to be a good way to tell the parser about expected lexer errors. Okay I understand the GROUP_ERR and ID problem but then how can ANTLR report which token it has replaced/inserted instead of the current token? – Har Jul 23 '14 at 09:37
  • Could you also elaborate as to why "." would be more ideal than ".+" (maybe there isnt a big difference between the two). My logic behind it is say that your input is mostly wrong, then that would result in many characters whereas grouping those characters could provide you a single token which you could analyze further rather than sticking it up manually and then analyzing it. – Har Jul 23 '14 at 10:32
  • I've updated my answer. I gave you a bad explanation for the `.+?` and corrected it. I also added an example at the end. – Lucas Trzesniewski Jul 23 '14 at 12:29
  • I see, I misunderstood the greedy operator, I thought it meant non-greedy in terms of the rest of the rules i.e. get more characters as long as no other rule matches. – Har Jul 23 '14 at 12:54
  • Could the unterminated string also be the other way around? UNTERMINATED_STRING_2: ('\\' ["\\] | ~["\\\r\n])* '"' ; – Har Aug 08 '14 at 11:15
  • 1
    @Har Not really since the lexer wouldn't know where to begin the token and it would interfere with normal tokens. Tokens of this type would be really long and would create a mess (I suppose they could win over real `STRING` tokens). Lexing is done left-to-right. – Lucas Trzesniewski Aug 08 '14 at 11:35