undescores seen as white spaces. Is it normal?

Question

In my grammar, I have this for white spaces:

WS:
    (' '|'\r'|'\t'|'\n') -> skip
;

However, the parser does not choke if I put an undescore instead of a space.

My-first-module_DEFINITIONS_::=

is recognized as

My-first-module DEFINITIONS ::=

Is there an option I have to set somehwere in the lexer ?

Thanks

Here is the reduced grammar that helps reproduce what I see

grammar ASN;

/*--------------------- Module definition -------------------------------------------*/

/* ModuleDefinition (see 13 in ITU-T X.680 (08/2015) */
moduleDefinition:  
    moduleIdentifier
    DEFINITIONS_LITERAL
    ASSIGN
    BEGIN_LITERAL
    END_LITERAL
;

moduleIdentifier: 
    UCASE_ID 
;



/*--------------------- LITERAL -----------------------------------------------------*/

DEFINITIONS_LITERAL:
    'DEFINITIONS'
;

BEGIN_LITERAL:
    'BEGIN'
;

END_LITERAL:
    'END'
;

ASSIGN:
    '::='
;

UCASE_ID:
    ('A'..'Z') ('-'('a'..'z'|'A'..'Z'|'0'..'9')|('a'..'z'|'A'..'Z'|'0'..'9'))* 
;


/* white-space (see 12.1.6 in ITU-T X.680 (08/2015) */
WS:
    (' '|'\r'|'\t'|'\n') -> skip
;

and the example that should not be accepted by the parser:

My-first-module_DEFINITIONS_::= 
BEGIN 

END

EDIT: I realize my problem is due to the fact I am using JUnit to run my test and I just check the syntax errors found by the parser. Here is the code, including Bart's answer, that makes the test fail if the lexer has issues ...

// load test data
InputStream inStream = getClass().getClassLoader().getResourceAsStream(resourceName);

if (inStream == null) {
    throw new RuntimeException("Resource not found: " + resourceName);
}

// create a CharStream that reads from standard input
CharStream input = new ANTLRInputStream(inStream);

// create a lexer that feeds off of input CharStream
ASNLexer lexer = new ASNLexer(input);
lexer.addErrorListener(new BaseErrorListener() {
    public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) {
        throw new RuntimeException(e);
        }
    }
);
// create a buffer of tokens pulled from the lexer
TokenStream tokens = new CommonTokenStream(lexer);
// create a parser that feeds off the tokens buffer
ASNParser parser = new ASNParser(tokens);
parser.moduleDefinition(); // begin parsing at moduleDefinition rule
assert(0 == parser.getNumberOfSyntaxErrors());

Could be that the lexer or parser recovers from it, could be something else. Impossible to say without seeing a "Minimal, Complete, and Verifiable example" (see: https://stackoverflow.com/help/mcve) — Bart Kiers, Feb 26 '18 at 15:46
I'll put my stuff online. By your answer, I gather this is not normal ? — YaFred, Feb 26 '18 at 15:49
"I gather this is not normal ?" - no, it's most likely ANTLR performs as expected. — Bart Kiers, Feb 26 '18 at 16:34
"I'll put my stuff online" - no need to post a hundreds of LOC, just enough to reproduce the problem. And please add the code to your question, not some off-site location. — Bart Kiers, Feb 26 '18 at 16:35
BTW, there's an existing ASN grammar here: https://github.com/antlr/grammars-v4/blob/master/asn/ASN.g4 (no idea how accurate it is though...) — Bart Kiers, Feb 26 '18 at 17:08
I tried it ... it does show that antlr has evolved so much that you can nearly copy paste the productions and have parser (a revolution compared to the first versions of antlr I used against ASN.1 ambiguous grammar). However, as soon as you try to parse fairly simple ASN.1 specifications, you find the limit of this grammar. The only way (in my opinion) is to start the grammar from scratch and create a set of unit tests to make sure you don't break it when you enrich ... — YaFred, Feb 26 '18 at 18:00
"The only way (in my opinion) is to start the grammar from scratch and create a set of unit tests to make use you don't break it [...]" oh-so true. This is probably the case in many of the user-contributed grammars. Good luck! — Bart Kiers, Feb 26 '18 at 18:03

score 1 · Accepted Answer · answered Feb 26 '18 at 17:01

The lexer recovers from the unexpected input. You can see this by running this class:

public class Main {

  public static void main(String[] args) {

    String source = "My-first-module_DEFINITIONS_::= \n" +
        "BEGIN \n" +
        "\n" +
        "END";

    ASNLexer lexer = new ASNLexer(CharStreams.fromString(source));
    ASNParser parser = new ASNParser(new CommonTokenStream(lexer));
    parser.moduleDefinition();
  }
}

which will print the following to your stdout:

line 1:15 token recognition error at: '_'
line 1:27 token recognition error at: '_'

There are a couple of options here:

1. add a catch-all rule

Add such a rule at the end of your grammar:

Other
 : .
 ;

and then handle Other in your parser as you see fit.

2. add custom `ErrorListener`

Do something like this:

lexer.addErrorListener(new BaseErrorListener(){
  @Override
  public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) {
    throw new RuntimeException(e);
  }
});

that will cause any errors in the lexer to throw a RuntimeException.

Note that ANTLR4 supports a more compact notation of defining character sets like this:

UCASE_ID:
    [A-Z] ( '-'? [a-zA-Z0-9] )*
;

WS:
    [ \t\r\n] -> skip
;

undescores seen as white spaces. Is it normal?

1 Answers1

1. add a catch-all rule

2. add custom ErrorListener

2. add custom `ErrorListener`