How to make parser decide on which alternative to use, based on the rule in the previous step

Question

I'm using ANTLR 4 to parse a protocol's messages, let's name it 'X'. Before extracting a message's information , I have to check if it complies with X's rules.

Suppose we have to parse X's 'FOO' message that follows the following rules:

Message starts with the 'messageIdentifier' that consists of the 3-letter reserved word FOO.
Message contains 5 fields, of which the first 2 are mandatory (must be included) and the rest 3 are optional (can be not included).
Message's fields are separated by the character '/'. If there is no information in a field (that means that the field is optional and is omitted) the '/' character must be preserved. Optional fields and their associated filed separators '/' at the end of the message may be omitted where no further information within the message is reported.
A message can expand in multiple lines. Each line must have at least one non-empty field (mandatory or optional). Moreover, each line must start with a '/' character and end with a non-empty field following a '\n' character. Exception is the first line that always starts with the reserved word FOO.
Each message's field also has its own rules regarding the accepted tokens, which will be shown in the grammar below.

Sample examples of valid FOO messages:

FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/MANDATORY2\n

/OPT 1\n

/HELLO\n

/100\n
FOO/MANDATORY_1/MANDATORY2\n
FOO/MANDATORY_1/MANDATORY2//HELLO/100\n
FOO/MANDATORY_1/MANDATORY2///100\n
FOO/MANDATORY_1/MANDATORY2/OPT 1\n
FOO/MANDATORY_1/MANDATORY2 ///100\n

Sample examples of non-valid FOO messages:

FOO\n

/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/\n

MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/MANDATORY2/OPT 1//\n
FOO/MANDATORY_1/MANDATORY2/OPT 1/\n

/100\n
FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
FOO/MANDATORY_1/MANDATORY2/\n
FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100

Below follows the grammar for the above message:

grammar Foo_Message


/* Parser Rules */

startRule : 'FOO' mandatoryField_1 ;

mandatoryField_1 : '/' field_1 NL? mandatoryField_2 ;

mandatoryField_2 : '/' field_2 NL? optionalField_3 ;

optionalField_3 : '/' field_3 NL? optionalField_4
                | '/' optionalField_4
                | optionalField_4
                ;

optionalField_4 : '/' field_4 NL? optionalField_5
                | '/' optionalField_5
                | optionalField_5
                ;

optionalField_5 : '/' field_5 NL?
                | NL
                ;

field_1 : (A | N | B | S)+ ;

field_2 : (A | N)+ ;

field_3 : (A | N | B)+ ;

field_4 : A+ ;

field_5 : N+ ;

/* Lexer Rules */

A : [A-Z]+ ;

N : [0-9]+ ;

B : ' ' -> skip ;

S : [*&@#-_<>?!]+ ;

NL : '\r'? '\n' ;

The above grammar parses correctly any input that complies with FOO message's rules. The problem resides in parsing a line that ends with the '/' character, which according to the protocol's FOO message's rules is an invalid input. I understand that the second alternatives of rules 'optionalField_3', 'optionalField_4' and 'optionalField_5' lead to this behavior but I can't figure out how to make a rule for this. Somehow I need the parser to remember that he came to 'optionalField_5' rule after seeing a non-omitted field in the previous rule, which if I am not mistaken can't be done in ANTLR as I can't check from which alternative of the previous rule I reached the current rule.

Is there a way to make the parser 'remember' this by some explicit option-rule? Or does my grammar need to be rearranged and if yes how?

Your "NL?" are in the wrong place. They should be before the literal '/'. I would rewrite the rules as one rule with three alternatives corresponding to one to three optionals. At the end of each alt, add 'NL'. You can then fold refactor into multiple rules and/or group (use parentheses) refactor. E.g., `s : FOO '/' f1 '/' f2 NL | FOO '/' f1 '/' f2 NL? '/' f3 NL | FOO '/' f1 '/' f2 NL? '/' f3 NL? '/' f4 NL | .....`. — kaby76, Feb 09 '22 at 12:26
Rewriting the 3 optional rules to one rule isn't correct according to the protocol as this allows the optional fields of the FOO message to come in any order. On the other hand writing an alternative for each possible combination (if I understand your example correctly) isn't feasible as the real FOO message and other messages have many more fields. — Jos, Feb 09 '22 at 12:45
Could you explain why "#5" in the bad examples is wrong? (I have a grammar that parses all other examples in the good and flags all in bad list completely correctly.) — kaby76, Feb 09 '22 at 13:31
Also, you should not reference `B` on the RHS of a parser rule. `B` is marked as skip, so the token will never be generated by the lexer. — kaby76, Feb 09 '22 at 14:48
You are trying to apply semantic in the syntax handling step. Instead you should just focus on the correct syntax when parsing the input, without complicating the grammar to follow a certain semantic and check semantic in a second step where you can also show descriptive error messages. — Mike Lischke, Feb 10 '22 at 07:51
First of all thanks for the reply and your time! You are right the #5 rule in the sample of bad examples isn't wrong, I accidentally copy pasted it and forgot to remove it (I 'll edit it to be avoid confusion). For the 'B' token rule, I mark it as skip as I don't want it to be generated. Although, I need to know if the creator of the message included correctly or incorrectly a space character in the corresponding field. — Jos, Feb 10 '22 at 08:06
@MikeLischke I think that the optional fields check isn't entirely semantic, as the real semantic checks come later where I want to see if the variable names, number ranges and codes used have a meaning taking into account the rest of the system. Also, the solution I wrote handles correctly the above optional fields problem in the parsing step — Jos, Feb 10 '22 at 09:11

score 1 · Answer 1 · answered Feb 09 '22 at 17:22

This grammar accepts all examples, character for character copied/pasted from your post, and flags a parse error all "non-valid FOO messages".

grammar X;
file_ : s* EOF ;
s : FOO '/' f1 '/' f2 (
    | NL? '/' f3
    | NL? ('/' f3 NL? | '/' ) '/' f4
    | NL? ('/' f3 NL? | '/' ) ('/' f4 NL? | '/') '/' f5
 ) NL;
f1 : (A | N | B | S)+ ;
f2 : (A | N | B)+ ;
f3 : (A | N | B)+ ;
f4 : A+ ;
f5 : N+ ;
FOO: 'FOO';
A : [A-Z]+ ;
N : [0-9]+ ;
B : ' ';
S : [*&@#\-_<>?!]+ ;
NL : '\r'? '\n' ;

One can easily refactor this with folds and groupings.

In your previous grammar, lexer symbol B was marked as "skip". Skipped symbols do not appear on any token stream, and they should not be used directly on the right-hand side of a parser rule (see field_1 from your original grammar). It is innocuous because it is alted with other symbols, i.e. field_3:(A|N|B)+; will operate the same as field_3:(A|N)+;, but the rule field_3:(A|N|B)+; may be misleading to others because B will never appear in the parse tree. I felt that you wanted to include spaces in the fields, because perhaps you would want to compute the text for a field. Therefore, I changed the rule for B to appear as a token.

#5 from "non-valid FOO messages" is exactly the same character for character of #1 from "valid FOO messages", which you can see here:

#1: FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n
#5: FOO/MANDATORY_1/MANDATORY2/OPT 1/HELLO/100\n

I don't understand your comment "this allows the optional fields of the FOO message to come in any order". The grammar here and the previous grammar I mentioned in the comments force field3 to occur before field4, which occurs before field5. There is no way that field5 could occur before a field3: the requisite number of '/' must appear before field5. Fields can be empty (see #4 of "valid FOO messages"). To handle that, the field specified is a grouping, e.g., ('/' f3 NL? | '/' ). For this grouping, the only sentential forms are "/", "/f3", "/f3\n". Note, this grouping can only occur with a succeeding field, so it is impossible for two "\n" to be next to each other.

The other way to approach this is to use semantic predicates or evaluate the semantic equations after the entire parse.

If there are many more fields, then you will likely not want to add alts for f6, f7, ...., f10000. In that case, I would suggest that you allow an arbitrary type for each field in the parse:

s : FOO '/' f1 '/' f2 (
    | NL? ('/' f NL? | '/' )* '/' f
 ) NL;

and validate the semantics afterwards.

I tried removing the 'B' token from the lexer rules as you said (because i mark it as -> skip) but when I apply input with spaces on fields that accept spaces the parsing fails. As I said in the above comment I don't want the 'B' token to appear in the parse tree, I only need to confirm that a filed can or cannot accept space characters. — Jos, Feb 10 '22 at 08:15
For the quote "this allows the optional fields ...", I thought you were suggesting to put in grammar all the unique mandatory and optional fields as a 'generic field' and separate the alternatives with '|' , which could lead to any rule being placed in any place. Moreover, the rules I have for the accepted tokens of a field aren't 'semantic' and need to be done on this level, as later in the semantic analysis I am going to check the logical validity of each correctly parsed field (e.g. the names of the fields, the numbers and everything else that makes sense with the rest of the system). — Jos, Feb 10 '22 at 08:34
As for the above grammar you wrote the following example that is a valid FOO message, I think it doesn't pass the parse: 1) FOO/field_1/field_2 Finally, I would like to thank you as I found the solution thanks to your comments and remarks. I am going to post it here for everyone that has a similar problem. I 'll mark your post as an answer, as it helped me revisit my grammar. — Jos, Feb 10 '22 at 08:37

score 0 · Accepted Answer · edited Feb 10 '22 at 09:34

Solution was to refactor my grammar to include rules for filledField and emptyField.

kaby76's post is marked as an answer as it helped towards the solution.

The refactored grammar:

grammar Foo_Message


/* Parser Rules */

startRule : 'FOO' mandatoryField_1 endRule ;

mandatoryField_1 : '/' field_1 NL? mandatoryField_2 ;

mandatoryField_2 : '/' field_2 NL? (filledOptionalField_3 | emptyOptionalField_3 )? ;

filledOptionalField_3 : '/' field_3 NL? (filledOptionalField_4 | emptyOptionalField_4)? ;
emptyOptionalField_3 : '/' (filledOptionalField_4 | emptyOptionalField_4) ;

filledOptionalField_4 : '/' field_4 NL? filledOptionalField_5? ;
emptyOptionalField_4 : '/' filledOptionalField_5 ;

filledOptionalField_5 : '/' field_5 ;

endRule : NL;

field_1 : (A | N | B | S)+ ;

field_2 : (A | N)+ ;

field_3 : (A | N | B)+ ;

field_4 : A+ ;

field_5 : N+ ;

/* Lexer Rules */

A : [A-Z]+ ;

N : [0-9]+ ;

B : ' ' -> skip ;

S : [*&@#-_<>?!]+ ;

NL : '\r'? '\n' ;

How to make parser decide on which alternative to use, based on the rule in the previous step

2 Answers2