In this question I asked about extracting tags from arbitrary text. The solution provided worked well, but there's one edge case I'd like to handle. To recap, I'm parsing arbitrary user-entered text and would like to have any occurrence of <
or >
to conform to valid tag syntax. Where an angle bracket isn't part of a valid tag, it should be escaped as <
or >
. The syntax I'm looking for is <foo#123>
where foo
is text from a fixed list of entries and 123
is a number [0-9]+
. The parser:
parser grammar TagsParser;
options {
tokenVocab = TagsLexer;
}
parse: (tag | text)* EOF;
tag: LANGLE fixedlist GRIDLET ID RANGLE;
text: NOANGLE;
fixedlist: FOO | BAR | BAZ;
The lexer:
lexer grammar TagsLexer;
LANGLE: '<' -> pushMode(tag);
NOANGLE: ~[<>]+;
mode tag:
RANGLE: '>' -> popMode;
GRIDLET: '#';
FOO: 'foo';
BAR: 'bar';
BAZ: 'baz';
ID: [0-9]+;
OTHERTEXT: . ;
This works well and successfully parses text such as:
<foo#123>
Hi <bar#987>!
<baz#1><foo#2>anythinghere<baz#3>
if 1 < 2
It also successfully fails the following when I use the BailErrorStrategy
:
<foo123>
<bar#a>
<foo#123H>
<unsupported#123>
if 1 < 2
The last one successfully fails because <
enters the tag
mode and it doesn't match a supported tag format. However, I would also like to avoid instances of >
in the text as well, so the following should fail as well:
if 2 > 1
That text should be specified as if 2 > 1
instead of having the raw angle bracket.
How can I modify the grammar so that occurrences of >
which aren't part of a valid tag fail to parse?