0

I am new to ANTLR and I am currently writing a lexer for cool language in ANTLR 4. For more about cool language please refer http://theory.stanford.edu/~aiken/software/cool/cool-manual.pdf.

One rule of cool language that I was trying to implement was detecting EOF inside Comments (may be nested) or String Constants and reporting as an error.

This is the rule that I wrote :

ERROR :  '(*' (COMMENT|~['(*'|'*)'])*? (~['*)']) EOF {reportError("EOF in comment");} 
        |'"' (~[\n"])* EOF {reportError("EOF in string");};
fragment COMMENT     : '(*' (COMMENT|~['(*'|'*)'])*? '*)'

Here the fragment COMMENT is a recursive rule that I used.

The function reportError used above reports error which is given below:

public void reportError(String errorString){
        setText(errorString);
        setType(ERROR);
}

But when I run it on the test file given below:

"Test String

It gives the following output :

line 1:0 token recognition error at: '"Test String\n'
#name "helloworld.cl"

Clearly the String with EOF in it was not recognised and ERROR was not detected.

Can someone help me in pointing out where I am going wrong as EOF (and hence, the error rule) is somehow not getting detected by the lexer.

If something is not clear please do mention it.

2 Answers2

0
'"' (~[\n"])* EOF

Here the ~[\n"]* part will stop at the first \n or " or at the end of the file.

If it stops at a ", the rule does not match because the EOF does not match and that's what we want because the string literal is properly terminated.

If it stops at the end of file, then the subsequent EOF will match and you'll get an ERROR token. So that's also what you want.

But if it stops at a \n, the EOF will not match and you won't get an error token even though you'd want one in this case. And since your input ends with a \n, that's exactly the scenario you're running into here. So in addition to EOF, you should also allow for erroneous string literals to end in \n:

'"' (~[\n"])* ('\n' | EOF)
sepp2k
  • 363,768
  • 54
  • 674
  • 675
0

You don't need a dedicated ERROR rule. You can handle that specific situation with an unfinished string directly in your error listener. Your comment rule shouldn't be a fragment however, as it has to recognize a lexeme on its own that must be handled (fragment rules are rather rules to be used in other lexer rules only).

When the lexer reaches a string but cannot finish it due to the end of the input, you can get the offending input from the current lexer state in your error listener. You can then check that to see what exactly wasn't finished, like I do here for 3 quoted text types in MySQL:

void LexerErrorListener::syntaxError(Recognizer *recognizer, Token *, size_t line,
                                     size_t charPositionInLine, const std::string &, std::exception_ptr ep) {
  // The passed in string is the ANTLR generated error message which we want to improve here.
  // The token reference is always null in a lexer error.
  std::string message;
  try {
    std::rethrow_exception(ep);
  } catch (LexerNoViableAltException &) {
    Lexer *lexer = dynamic_cast<Lexer *>(recognizer);
    CharStream *input = lexer->getInputStream();
    std::string text = lexer->getErrorDisplay(input->getText(misc::Interval(lexer->tokenStartCharIndex, input->index())));
    if (text.empty())
      text = " "; // Should never happen.

    switch (text[0]) {
      case '/':
        message = "Unfinished multiline comment";
        break;
      case '"':
        message = "Unfinished double quoted string literal";
        break;
      case '\'':
        message = "Unfinished single quoted string literal";
        break;
      case '`':
        message = "Unfinished back tick quoted string literal";
        break;

      default:
        // Hex or bin string?
        if (text.size() > 1 && text[1] == '\'' && (text[0] == 'x' || text[0] == 'b')) {
          message = std::string("Unfinished ") + (text[0] == 'x' ? "hex" : "binary") + " string literal";
          break;
        }

        // Something else the lexer couldn't make sense of (likely there is no rule that accepts this input).
        message = "\"" + text + "\" is no valid input at all";
        break;
    }
    owner->addError(message, 0, lexer->tokenStartCharIndex, line, charPositionInLine,
                    input->index() - lexer->tokenStartCharIndex);
  }
}

This code was taken from the parser module in MySQL Workbench.

Mike Lischke
  • 48,925
  • 16
  • 119
  • 181