I found the following (simplified) grammar on the internet as I was looking for a solution to a problem where I had to parse syntax similar to Markdown.
grammar Markdown;
parse : stat+;
stat : bold
| text
| WS
;
text : TEXT|SPACE;
bold : ('**'stat*'**');
TEXT : [a-zA-Z0-9]+;
SPACE : ' ';
WS : [\t\r\n]+;
What I want to achieve is that antlr4 kind of does a shortest match first on a sentence that looks like **bold1** not bold **bold2**
.
This would mean that bold1
would be bold not bold
not, and bold2
bold again.
However, due to antlr4 using longest match first, antlr4 parses the example as two nested bolds which is wrong.
I have already thought about multiple very complicated solutions to this problem which I actually don't want to use. Is there a simple solution?
UPDATE:
Apparently I simplified the example too much.
Now here is an expanded grammar (that does not recognize markdown anymore) but which illustrates the problem I have.
It is important to note that stat
can also be a variable
which is just a placeholder for any keyword my language can possibly contain.
grammar StyleParser;
parse : styled_stat+;
styled_stat : italic
| bold
| underline
| stat
;
stat : variable
| text
;
variable: VARIABLE;
text : TEXT|SPACE;
italic : ITALIC (stat | italic_bold | italic_underline)* ITALIC;
italic_bold: BOLD (stat | italic_bold_underline)* BOLD;
italic_bold_underline: UNDERLINE stat* UNDERLINE;
italic_underline: UNDERLINE (stat | italic_underline_bold)* UNDERLINE;
italic_underline_bold: BOLD stat* BOLD;
bold : BOLD (stat | bold_italic | bold_underline)* BOLD;
bold_italic: ITALIC (stat | bold_italic_underline)* ITALIC;
bold_italic_underline: UNDERLINE stat* UNDERLINE;
bold_underline: UNDERLINE (stat | bold_underline_italic)* UNDERLINE;
bold_underline_italic: ITALIC stat* ITALIC;
underline : UNDERLINE (stat | underline_bold | underline_italic)* UNDERLINE;
underline_italic: ITALIC (stat | underline_italic_bold)* ITALIC;
underline_italic_bold: BOLD stat* BOLD;
underline_bold: BOLD (stat | underline_bold_italic)* BOLD;
underline_bold_italic: ITALIC stat* ITALIC;
SPACE : ' ';
VARIABLE : 'VAR';
TEXT : [a-zA-Z0-9]+;
ITALIC: '//';
BOLD: '==';
UNDERLINE: '__';
With this grammar I can not nest the same style, but I can nest different styles. For example, it parses ==bold1 __underline //italic//__== not __underline__ //italic// bold ==VAR==
correctly.
The thing is that the amount of rules grows exponentially with the amount of styles you introduce, and I want to avoid this.