Lexing token ambiguity in ANTLR4

Question

I have a very interesting problem with parsing the following grammar (of Convnetional Commits) - which is a convention how git commit messages should be formatted.

<type>[optional scope]: <description>

[optional body]

[optional footer(s)]

the body is simply multi-line text where anything goes
the footer is key value pairs with fobar: this is value format and newline separating them.

Now, regarding my dilemma: what would be the best way to differentiate the body part from the footer part? According to the spec, those should be separated by two newline characters so at first I thought this would be good fit for ANTLR4 island grammars. I came up with something like what I posted here, but after some testing, I discovered it is not flexible - it won't work if the body is not there (body section is optional) but the footer is there.

I can think of a couple of ways to restrict the grammar to a certain language and implement this differentiation with semantic predicates but ideally, I would like to avoid that.

Now, I think that the problem boils down how to differentiate properly between KEY and SINGLE_LINE tokens which do conflict (in the next iteration of my implementation)

mode Text;
KEY: [a-z][a-z_-]+;
SINGLE_LINE: ~[\n]+;

MULTI_LINE: SINGLE_LINE (NEWLINE SINGLE_LINE)*;

NEXT: NEWLINE NEWLINE;

What would be the best way to differentiate between KEY and SINGLE_LINE?

The specification is ambiguous. A commit that ends with "\n\na: b" could interpret the `a: b` either as the last line of the body or the first line of the footer. — Raymond Chen, Feb 04 '23 at 16:16
Using ANTLR (or some other parser generator) is overkill for this IMO. — Bart Kiers, Feb 04 '23 at 16:29
@BartKiers I know, this can be solved by uber regex, for example. Or it shouldn't be too hard to parse it manually. In part, I am doing this as a kind of "programming kata" :) — Michael, Feb 04 '23 at 17:12

Bart Kiers · Accepted Answer · 2023-02-05T12:27:41.403

I'd do something like this:

ConventionalCommitsLexer.g4

lexer grammar ConventionalCommitsLexer;

options {
  caseInsensitive=true;
}

TYPE : [a-z]+;
LPAR : '(' -> pushMode(Scope);
COL  : ':' -> pushMode(Text);

fragment SPACE : [ \t];

mode Scope;

 SCOPE : ~[)]+;
 RPAR  : ')' SPACE* -> popMode;

mode Text;

 COL2    : ':' -> type(COL);
 SPACES : SPACE+ -> skip;
 WORD   : ~[: \t\r\n]+;
 NL     : SPACE* '\r'? '\n' SPACE*;

ConventionalCommitsParser.g4

parser grammar ConventionalCommitsParser;

options {
  tokenVocab=ConventionalCommitsLexer;
}

commit
 : TYPE scope? COL description ( NL NL body )? ( NL NL footer )? EOF
 ;

scope
 : LPAR SCOPE RPAR
 ;

description
 : word+
 ;

// A 'body' cannot start with `WORD COL`, hence: `WORD WORD`
body
 : WORD WORD word* ( NL word+ )*
 ;

footer
 : key_value ( NL key_value )* NL?
 ;

key_value
 : WORD COL word+
 ;

word
 : WORD
 | COL
 ;

Parsing the input (body + footer):

fix(some_module): this is a commit description
    
Some more in-depth description of what was fixed: this
can be a multi-line text, not only a one-liner.

Signed-off: john.doe@some.domain.com
Another-Key: another value with : (colon)
Some-Other-Key: some other value

result:

Parsing the input (only body):

fix(some_module): this is a commit description
    
Some more in-depth description of what was fixed: this
can be a multi-line text, not only a one-liner.

result:

Parsing the input (only footer):

fix(some_module): this is a commit description

Signed-off: john.doe@some.domain.com
Another-Key: another value with : (colon)
Some-Other-Key: some other value

result:

Oh wow, short and elegant! I ended up implementing complex grammar with semantic predicates and conditional mode switching with "actions" lol. — Michael, Feb 10 '23 at 20:32
You have used "WORD WORD" at the beginning of the "body" rule to differentiate between the body and footer sections, am I correct? With exception of a single edge-case your solution works wonderfully! Thank you so much for such a detailed answer! (if "body" is a single word, the grammar won't recognize it correctly, right?) — Michael, Feb 10 '23 at 20:41
Edit: after adjusting body to be "WORD | WORD WORD | WORD WORD word* ( NL word+ )*", it seems to cover the edge-case I mentioned. Awesome! — Michael, Feb 10 '23 at 22:22

Lexing token ambiguity in ANTLR4

1 Answers1

ConventionalCommitsLexer.g4

ConventionalCommitsParser.g4