Parsing Semantic Version using Antlr

Question

I translated the SemVer 2 BNF grammar to the following Antlr grammar.

grammar SemVer;

@header {
package com.me.semver;
}

semVer : normal ('-' preRelease)? ('+' build)? ;
normal : major '.' minor '.' patch ;
major : NUM ;
minor : NUM ;
patch : NUM ;
preRelease : PRE_RELEASE ('.' preRelease)* ;
build : BUILD ('.' build)*;
NUM : '0'
    | POSITIVE_DIGIT
    | POSITIVE_DIGIT DIGITS
    ;
BUILD : ALPHANUM
      | DIGITS
      ;
PRE_RELEASE : ALPHANUM
            | NUM
            ;
fragment
ALPHANUM : NON_DIGIT
         | NON_DIGIT CHARS
         | CHARS NON_DIGIT
         | CHARS NON_DIGIT CHARS
         ;
fragment
CHARS : CHAR+ ;
fragment
CHAR : DIGIT
     | NON_DIGIT
     ;
fragment
NON_DIGIT : LETTER
          | '-'
          ;
fragment
DIGITS : DIGIT+ ;
fragment
DIGIT : '0'
      | POSITIVE_DIGIT
      ;
fragment
POSITIVE_DIGIT : [1-9] ;
fragment
LETTER : [a-zA-Z] ;

But parsing 1.0.0-beta+exp.sha.5114f85 gives the following error:

line 1:4 mismatched input '0-beta' expecting NUM

The output from the listener is as follows:

Normal: 1.0.0-beta
Major: 1
Minor: 0
Patch: 0-beta
Build: exp.sha.5114f85
Build: sha.5114f85
Build: 5114f85

Clearly, the patch version is not what it should be. The correct output would have Patch = 0, Pre release = beta, and Build = exp.sha.5114f85.

How can I fix the grammar?

Bart Kiers · Accepted Answer · 2020-11-09T08:27:38.507

You have too many overlapping lexer rules. For example, the input 0 could be matched by any of the following 3 rules:

NUM : '0'
    | POSITIVE_DIGIT
    | POSITIVE_DIGIT DIGITS
    ;
BUILD : ALPHANUM
      | DIGITS
      ;
PRE_RELEASE : ALPHANUM
            | NUM
            ;

and since NUM is placed first, the input 0 will always become a NUM token. It doesn't matter what token the parser is trying to match, it will always be a NUM token.

This is just how ANTLR's lexer works:

it tries to match as much characters as possible for each token, and
when two or more lexer rules match the same amount of characters, the one defined first "wins".

Given your grammar and the input "1.0.0-beta+exp.sha.5114f85", these tokens are created:

NUM                       `1`
null                      `.`
NUM                       `0`
null                      `.`
BUILD                     `0-beta`
null                      `+`
BUILD                     `exp`
null                      `.`
BUILD                     `sha`
null                      `.`
BUILD                     `5114f85`

Notice the 0-beta is being tokenized as a single BUILD token (rule #1).

What you should do is define lexer rules that do not overlap. In your case, that would mean defining these rules/tokens:

HYPHEN
 : '-'
 ;

PLUS
 : '+'
 ;

DOT
 : '.'
 ;

ZERO_DIGIT
 : '0'
 ;

POSITIVE_DIGIT
 : [1-9]
 ;

LETTER
 : [a-zA-Z]
 ;

and a rule like DIGIT and DIGITS would then become a parser rules instead:

digits
 : digit+
 ;

digit
 : ZERO_DIGIT
 | POSITIVE_DIGIT
 ;

I've several follow up questions. 1. A `non-digit` is a `letter` and `-`; you seem to have replaced `non-digit` with `LETTER`, that's incorrect. 2. Numeric identifiers in a pre-release version [must not include leading zeroes](https://semver.org/#spec-item-9); however, no such restriction is placed on the numeric identifiers for build metadata. This subtle distinction is absent for your grammar (I think that's why the BNF used `digits ` and `numeric identifier` separately). 3. What is the difference between your `valid_semver` rule and my `semVer` rule (I'm not an Antlr expert)? — Abhijit Sarkar, Nov 08 '20 at 22:29
(Since I exceeded comment length above)...4. Why create aliases like `build : dot_separated_build_identifiers`? A more common sense approach seems to rename `dot_separated_build_identifiers` and just use it. — Abhijit Sarkar, Nov 08 '20 at 22:31
I removed the grammar and emphasized the actual problem in your grammar. — Bart Kiers, Nov 09 '20 at 07:54

Parsing Semantic Version using Antlr

1 Answers1