How to have shortest match first when implementing mark down text styling operators in an antlr4 grammar?

Question

I found the following (simplified) grammar on the internet as I was looking for a solution to a problem where I had to parse syntax similar to Markdown.

grammar Markdown;

parse    :   stat+;

stat    :   bold
        |   text
        |   WS
        ;

text    :   TEXT|SPACE;

bold    :   ('**'stat*'**');

TEXT    :   [a-zA-Z0-9]+;

SPACE   :   ' ';

WS      :   [\t\r\n]+;

What I want to achieve is that antlr4 kind of does a shortest match first on a sentence that looks like **bold1** not bold **bold2**. This would mean that bold1 would be bold not bold not, and bold2 bold again.

However, due to antlr4 using longest match first, antlr4 parses the example as two nested bolds which is wrong.

I have already thought about multiple very complicated solutions to this problem which I actually don't want to use. Is there a simple solution?

UPDATE: Apparently I simplified the example too much. Now here is an expanded grammar (that does not recognize markdown anymore) but which illustrates the problem I have. It is important to note that stat can also be a variable which is just a placeholder for any keyword my language can possibly contain.

grammar StyleParser;

parse    :   styled_stat+;

styled_stat : italic
            | bold
            | underline
            | stat
            ;

stat    :   variable
        |   text
        ;

variable: VARIABLE;
text    :   TEXT|SPACE;

italic  :   ITALIC (stat | italic_bold | italic_underline)* ITALIC;
italic_bold: BOLD (stat | italic_bold_underline)* BOLD;
italic_bold_underline: UNDERLINE stat* UNDERLINE;

italic_underline: UNDERLINE (stat | italic_underline_bold)* UNDERLINE;
italic_underline_bold: BOLD stat* BOLD;

bold    :   BOLD (stat | bold_italic | bold_underline)* BOLD;
bold_italic: ITALIC (stat | bold_italic_underline)* ITALIC;
bold_italic_underline: UNDERLINE stat* UNDERLINE;

bold_underline: UNDERLINE (stat | bold_underline_italic)* UNDERLINE;
bold_underline_italic: ITALIC stat* ITALIC;

underline   :   UNDERLINE (stat | underline_bold | underline_italic)* UNDERLINE;   
underline_italic: ITALIC (stat | underline_italic_bold)* ITALIC;
underline_italic_bold: BOLD stat* BOLD;

underline_bold: BOLD (stat | underline_bold_italic)* BOLD;
underline_bold_italic: ITALIC stat* ITALIC; 

SPACE   :   ' ';

VARIABLE : 'VAR';
TEXT    :   [a-zA-Z0-9]+;

ITALIC: '//';
BOLD: '==';
UNDERLINE: '__';

With this grammar I can not nest the same style, but I can nest different styles. For example, it parses ==bold1 __underline //italic//__== not __underline__ //italic// bold ==VAR== correctly. The thing is that the amount of rules grows exponentially with the amount of styles you introduce, and I want to avoid this.

Why the distinction between `SPACE` and `WS`? Anyway, your problem is that your grammar does not reflect the fact that bold parts can't be nested inside each other. So the `bold` rule should not be mutually recursive with `stat`. — sepp2k, Mar 30 '21 at 10:57
I'm sorry for the confusion. I over-simplified the example. Please take another look at my update. — Tobias Marschall, Mar 30 '21 at 12:22

score 1 · Answer 1 · answered Mar 30 '21 at 19:08

Parsing markdown is just non-trivial. One approach is to

lex the bits that are intrinsically non-ambiguous;
on lexer emit, evaluate the semantic context of each and adjust as appropriate;
parse the enhanced lexer stream, again to the extent that token sequences are intrinsically non-ambiguous;
tree-walk to annotate the tree or otherwise build data structures that fully describe tree-elements in markdown syntax terms.

So, for the general case of a markdown-styled WORD, defined as some string of text exclusive of qualifying attributes, the parser definition is

word
    : attrLeft* 
      w=( WORD | ENTITY | UNICODE
        | URL  | URLTAG | SPAN | HTML 
        )
      attrRight*
    ;

attrLeft  : LBOLD | LITALIC | LSTRIKE | LDQUOTE | LSQUOTE ;
attrRight : RBOLD | RITALIC | RSTRIKE | RDQUOTE | RSQUOTE ;

In the lexer, define all attributes as default left and reserve tokens for right attributes and WORD

tokens {
    WORD,
    RBOLD,
    RITALIC,
    RSTRIKE,
    RDQUOTE,
    RSQUOTE
}

// attributes
LBOLD   : Bold   ;
LITALIC : Italic ;
LSTRIKE : Strike ;
LDQUOTE : Quote  ;
LSQUOTE : Mark   ;

... 

// last line in the lexer
CHAR : EscChar | . ;

In the lexer superclass, override

public void emit(Token t) {...}

and decide whether

any particular left attribute should really be reassigned as a right attribute
a CHAR should be accumulated into a current WORD or should be added to a new WORD instance.

Now, a tree-walker can evaluate the sequences of words and handle treatment of the potentially multiple overlapping, nested attributes.

How to have shortest match first when implementing mark down text styling operators in an antlr4 grammar?

1 Answers1