Antlr3 - Non Greedy Double Quoted String with Escaped Double Quote

Question

The following Antlr3 Grammar file doesn't cater for escaped double quotes as part of the STRING lexer rule. Any ideas why?

Expressions working:

\"hello\"
ref(\"hello\",\"hello\")

Expressions NOT working:

\"h\"e\"l\"l\"o\"
ref(\"hello\", \"hel\"lo\")

Antlr3 grammar file runnable in AntlrWorks:

grammar Grammar;

options
{
    output=AST;
    ASTLabelType=CommonTree;
    language=CSharp3;
}

public oaExpression
   : exponentiationExpression EOF!
   ;

exponentiationExpression
    :       equalityExpression ( '^' equalityExpression )*
    ;

equalityExpression
    :       relationalExpression ( ( ('==' | '=' ) | ('!=' | '<>' ) ) relationalExpression )*
    ;

relationalExpression
    :       additiveExpression ( ( '>' | '>=' | '<' | '<=' ) additiveExpression )*
    ;

additiveExpression
    :       multiplicativeExpression ( ( '+' | '-' ) multiplicativeExpression )*
    ;

multiplicativeExpression
    :       primaryExpression ( ( '*' | '/' ) primaryExpression )*
    ;

primaryExpression
    :       '(' exponentiationExpression ')' | value | identifier (arguments )?
    ;

value
    :       STRING
    ;

identifier
    :       ID
    ;

expressionList
    :       exponentiationExpression ( ',' exponentiationExpression )*
    ;

arguments
    :       '(' ( expressionList )? ')'
    ;                      

/*
 * Lexer rules
 */

ID
    :       LETTER (LETTER | DIGIT)*
    ;

STRING
    :       '"' ( options { greedy=false; } : ~'"' )* '"'
    ;

WS
    :       (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=Hidden;}
    ;

/*
 * Fragment Lexer rules
 */

fragment
LETTER
    :       'a'..'z'
    |       'A'..'Z'
    |       '_'
    ;

fragment
EXPONENT
    :       ('e'|'E') ('+'|'-')? ( DIGIT )+
    ;

fragment
HEX_DIGIT
    :       ( DIGIT |'a'..'f'|'A'..'F')
    ;

fragment
DIGIT
    :       '0'..'9'
    ;

Why? I mean the rule only matches the entire input when there is no closing quote, in which case the input is invalid anyway, right? Could you clarify? — Bart Kiers, Mar 05 '14 at 18:50
Hi @BartKiers, I've edited the question to provide the full grammar. I've tried your suggestions but they don't appear to work. — , Mar 06 '14 at 14:01

Bart Kiers · Answer 1 · 2014-03-12T13:27:13.187

Try this:

STRING
 : '"'                          // a opening quote
   (                            // start group
     '\\' ~('\r' | '\n')        // an escaped char other than a line break char
     |                          // OR
     ~('\\' | '"'| '\r' | '\n') // any char other than '"', '\' and line breaks
   )*                           // end group and repeat zero or more times
   '"'                          // the closing quote
 ;

When I test the 4 different test cases from your comment:

"\"hello\""
"ref(\"hello\",\"hello\")"
"\"h\"e\"l\"l\"o\""
"ref(\"hello\", \"hel\"lo\")"

with the lexer rule I suggested:

grammar T;

parse
 : string+ EOF
 ;

string
 : STRING
 ;

STRING
 : '"' ('\\' ~('\r' | '\n') | ~('\\' | '"'| '\r' | '\n'))* '"'
 ;

SPACE
 : (' ' | '\t' | '\r' | '\n')+ {skip();}    
 ;

ANTLRWorks' debugger produces the following parse tree:

enter image description here

In other words: it works just fine (on my machine :)).

EDIT II

And I've also used your grammar (making some small changes to make it Java compatible) where I replaced the incorrect STRING rule into the one I suggested:

oaExpression
   :        STRING+ EOF!
   //: exponentiationExpression EOF!
   ;

exponentiationExpression
    :       equalityExpression ( '^' equalityExpression )*
    ;

equalityExpression
    :       relationalExpression ( ( ('==' | '=' ) | ('!=' | '<>' ) ) relationalExpression )*
    ;

relationalExpression
    :       additiveExpression ( ( '>' | '>=' | '<' | '<=' ) additiveExpression )*
    ;

additiveExpression
    :       multiplicativeExpression ( ( '+' | '-' ) multiplicativeExpression )*
    ;

multiplicativeExpression
    :       primaryExpression ( ( '*' | '/' ) primaryExpression )*
    ;

primaryExpression
    :       '(' exponentiationExpression ')' | value | identifier (arguments )?
    ;

value
    :       STRING
    ;

identifier
    :       ID
    ;

expressionList
    :       exponentiationExpression ( ',' exponentiationExpression )*
    ;

arguments
    :       '(' ( expressionList )? ')'
    ;                      

/*
 * Lexer rules
 */

ID
    :       LETTER (LETTER | DIGIT)*
    ;

//STRING
//    :       '"' ( options { greedy=false; } : ~'"' )* '"'
//    ;
STRING
    :       '"' ('\\' ~('\r' | '\n') | ~('\\' | '"'| '\r' | '\n'))* '"'
    ;

WS
    :       (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN;} /*{$channel=Hidden;}*/
    ;

/*
 * Fragment Lexer rules
 */

fragment
LETTER
    :       'a'..'z'
    |       'A'..'Z'
    |       '_'
    ;

fragment
EXPONENT
    :       ('e'|'E') ('+'|'-')? ( DIGIT )+
    ;

fragment
HEX_DIGIT
    :       ( DIGIT |'a'..'f'|'A'..'F')
    ;

fragment
DIGIT
    :       '0'..'9'
    ;

which parses the input from my previous example in an identical parse tree.

Hi @BartKiers, the above Lexer rule works in the following scenarios: **"\"hello\""**, **"ref(\"hello\",\"hello\")"** but doesn't work for escaped double quotes scenarios: **"\"h\"e\"l\"l\"o\""**, **"ref(\"hello\", \"hel\"lo\")"**. Thanks for your help. — , Mar 06 '14 at 16:48
@bjrave, I have no idea how you're testing it, but it works just fine. Good luck though. — Bart Kiers, Mar 06 '14 at 19:15
I agree your cut down grammar works fine in ANTLRWorks. Using ANTLRWorks and the compiled parser via C# code, the STRING Lexer rule doesn't appear to work once integrated with the rest of my grammar file. I've added a link to my grammar file should you wish to take a look. — , Mar 07 '14 at 09:44
@bjrave, no, I won't have a look: your grammar it full of embedded code which would mean I'd need to filter all of that out before I would be able to give it a spin. Perhaps if you post your grammar without code in your original question, I or someone else might have a look. An external link is often not looked at. — Bart Kiers, Mar 07 '14 at 16:28
I see your point. I've stripped the embedded code from the grammar file. — , Mar 10 '14 at 17:51
@bjrave, in the grammar you stripped, you are still using the (icorrect) `STRING` rule: `'"' ( options { greedy=false; } : ~'"' )* '"'`. See my second **EDIT** — Bart Kiers, Mar 12 '14 at 13:27

score 0 · Answer 2 · answered Mar 06 '14 at 07:52

This is how I do this with strings that can contain escape sequences (not just \" but any):

DOUBLE_QUOTED_TEXT
@init { int escape_count = 0; }:
    DOUBLE_QUOTE
    (
        DOUBLE_QUOTE DOUBLE_QUOTE { escape_count++; }
        | ESCAPE_OPERATOR .  { escape_count++; }
        | ~(DOUBLE_QUOTE | ESCAPE_OPERATOR)
    )*
    DOUBLE_QUOTE
    { EMIT(); LTOKEN->user1 = escape_count; }
;

The rule additionally counts the escapes and stores them in the token. This allows the receiver to quickly see if it needs to do anything with the string (if user1 > 0). If you don't need that remove the @init part and the actions.

I've tried your suggestion but it doesn't appear to make any difference. I've updated my question to include the full grammar. — , Mar 06 '14 at 14:18

Antlr3 - Non Greedy Double Quoted String with Escaped Double Quote

2 Answers2

EDIT II