ANTLR4 grammar not correctly matching escaped quotes in strings

Question

I'm trying to create a grammar for a language that uses double quotes for strings and allows escaping of quotes with a backslash. I'm using ANTLR4 for parsing the input.

I've defined the following rule for matching strings:

STRING:
    '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
    ;
fragment
ESC_SEQ
    :   '\\'
        (   // The standard escaped character set such as tab, newline, etc.
            [btnfr"'\\]
            |
        |   // A Java style Unicode escape sequence
            UNICODE_ESC
        |   // Invalid escape
            .
        |   // Invalid escape at end of file
            EOF
        )
    ;

fragment
UNICODE_ESC
    :   'u' (HEX_DIGIT (HEX_DIGIT (HEX_DIGIT HEX_DIGIT?)?)?)?
;

However, this rule doesn't seem to correctly match strings that contain escaped quotes at the end of string. For example, the string "test \"string\" that works" is parsed correctly but when my string is like "test string that does \"not work\"" this rule does not work. It also works for \n and other escaped chars.

(I am expecting to see "test string that "works"" as output)

I've tried modifying the rule to escape the backslash in the quote character, like this:

STRING:
    '"' ( ESC_SEQ | ~('\\'|'"') )* '"' | ('\\' '"'))
fragment
ESC_SEQ
    :   '\\'
        (   // The standard escaped character set such as tab, newline, etc.
            [btnfr"'\\]
            |
        |   // A Java style Unicode escape sequence
            UNICODE_ESC
        |   // Invalid escape
            .
        |   // Invalid escape at end of file
            EOF
        )
    ;

fragment
UNICODE_ESC
    :   'u' (HEX_DIGIT (HEX_DIGIT (HEX_DIGIT HEX_DIGIT?)?)?)?
;
    ;

But this still doesn't work.

What am I doing wrong? How can I modify my grammar to correctly match strings with escaped quotes?

score 0 · Answer 1 · answered Apr 03 '23 at 02:30

0

The ESC_SEQ does not 'unescape' the sequence. You match \" so that is what you get in the output.

See https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md for how you can rewrite/skip etc on the various tokens to fix it.

answered Apr 03 '23 at 02:30

Cine

4,255
26
46

HINT: It is not possible to do in ANTLR alone, you need to fall-back to implement a code handler for various usecases... e.g. that \u1234 becomes a unicode char NEEDs you to tell it how it should be parsed: 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT? { setText(Character.toString((char) Integer.parseInt(getText().substring(2), 16))); } – Cine Apr 03 '23 at 02:40
what do you mean by ESC_SEQ does not "unescape" the sequence ? Sory i couldn't understand it. – Oguzhan Kose Apr 03 '23 at 23:05
@OguzhanKose Very verbatim, you have a rule called ESC_SEQ which parses a '\' followed by other things, such as '\"', but it JUST parses it it does not interpret it to a meaning, thus '\"' in the INPUT is '\"' in the OUTPUT. – Cine Apr 14 '23 at 06:27

score 0 · Answer 2 · answered Apr 03 '23 at 06:08

I cannot reproduce that. All of the following 4 strings match just fine:

""
"simple"
"test \"string\" that works"
"test string that does \"not work\""

Testes with the grammar:

lexer grammar StringLexer;

STRING
    :   '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
    ;

SPACE
    : [ \t\r\n] -> skip
    ;

fragment
ESC_SEQ
    :   '\\'
        (   // The standard escaped character set such as tab, newline, etc.
            [btnfr"'\\]
            |
        |   // A Java style Unicode escape sequence
            UNICODE_ESC
        |   // Invalid escape
            .
        |   // Invalid escape at end of file
            EOF
        )
    ;

fragment
UNICODE_ESC
    :   'u' (HEX_DIGIT (HEX_DIGIT (HEX_DIGIT HEX_DIGIT?)?)?)?
    ;

fragment
HEX_DIGIT
    :   [0-9a-fA-F]
    ;

and Java code:

String source = "\"\" \"simple\" \"test \\\"string\\\" that works\" \"test string that does \\\"not work\\\"\"";
StringLexer lexer = new StringLexer(CharStreams.fromString(source));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();

for (Token t : stream.getTokens()) {
    System.out.printf("%-20s '%s'%n",
            StringLexer.VOCABULARY.getSymbolicName(t.getType()),
            t.getText().replace("\n", "\\n"));
}

which prints:

STRING               '""'
STRING               '"simple"'
STRING               '"test \"string\" that works"'
STRING               '"test string that does \"not work\""'
EOF                  '<EOF>'

when i try to parse it it takes escaped double quote as the end of string and starts another string from the original unescaped double quote. I cant understand why it works when it is not at the end of strinng. Do you have any idea about it ? — Oguzhan Kose, Apr 03 '23 at 22:59
Update your question with a small class others can easily run that shows the problem. — Bart Kiers, Apr 04 '23 at 04:50

ANTLR4 grammar not correctly matching escaped quotes in strings

2 Answers2