-1

The following Antlr3 Grammar file doesn't cater for escaped double quotes as part of the STRING lexer rule. Any ideas why?

Expressions working:

  • \"hello\"
  • ref(\"hello\",\"hello\")

Expressions NOT working:

  • \"h\"e\"l\"l\"o\"
  • ref(\"hello\", \"hel\"lo\")

Antlr3 grammar file runnable in AntlrWorks:

grammar Grammar;

options
{
    output=AST;
    ASTLabelType=CommonTree;
    language=CSharp3;
}

public oaExpression
   : exponentiationExpression EOF!
   ;

exponentiationExpression
    :       equalityExpression ( '^' equalityExpression )*
    ;

equalityExpression
    :       relationalExpression ( ( ('==' | '=' ) | ('!=' | '<>' ) ) relationalExpression )*
    ;

relationalExpression
    :       additiveExpression ( ( '>' | '>=' | '<' | '<=' ) additiveExpression )*
    ;

additiveExpression
    :       multiplicativeExpression ( ( '+' | '-' ) multiplicativeExpression )*
    ;

multiplicativeExpression
    :       primaryExpression ( ( '*' | '/' ) primaryExpression )*
    ;

primaryExpression
    :       '(' exponentiationExpression ')' | value | identifier (arguments )?
    ;

value
    :       STRING
    ;

identifier
    :       ID
    ;

expressionList
    :       exponentiationExpression ( ',' exponentiationExpression )*
    ;

arguments
    :       '(' ( expressionList )? ')'
    ;                      

/*
 * Lexer rules
 */

ID
    :       LETTER (LETTER | DIGIT)*
    ;

STRING
    :       '"' ( options { greedy=false; } : ~'"' )* '"'
    ;

WS
    :       (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=Hidden;}
    ;

/*
 * Fragment Lexer rules
 */

fragment
LETTER
    :       'a'..'z'
    |       'A'..'Z'
    |       '_'
    ;

fragment
EXPONENT
    :       ('e'|'E') ('+'|'-')? ( DIGIT )+
    ;

fragment
HEX_DIGIT
    :       ( DIGIT |'a'..'f'|'A'..'F')
    ;

fragment
DIGIT
    :       '0'..'9'
    ;
  • Why? I mean the rule only matches the entire input when there is no closing quote, in which case the input is invalid anyway, right? Could you clarify? – Bart Kiers Mar 05 '14 at 18:50
  • Hi @BartKiers, I've edited the question to provide the full grammar. I've tried your suggestions but they don't appear to work. –  Mar 06 '14 at 14:01

2 Answers2

3

Try this:

STRING
 : '"'                          // a opening quote
   (                            // start group
     '\\' ~('\r' | '\n')        // an escaped char other than a line break char
     |                          // OR
     ~('\\' | '"'| '\r' | '\n') // any char other than '"', '\' and line breaks
   )*                           // end group and repeat zero or more times
   '"'                          // the closing quote
 ;

When I test the 4 different test cases from your comment:

"\"hello\""
"ref(\"hello\",\"hello\")"
"\"h\"e\"l\"l\"o\""
"ref(\"hello\", \"hel\"lo\")"

with the lexer rule I suggested:

grammar T;

parse
 : string+ EOF
 ;

string
 : STRING
 ;

STRING
 : '"' ('\\' ~('\r' | '\n') | ~('\\' | '"'| '\r' | '\n'))* '"'
 ;

SPACE
 : (' ' | '\t' | '\r' | '\n')+ {skip();}    
 ;

ANTLRWorks' debugger produces the following parse tree:

enter image description here

In other words: it works just fine (on my machine :)).

EDIT II

And I've also used your grammar (making some small changes to make it Java compatible) where I replaced the incorrect STRING rule into the one I suggested:

oaExpression
   :        STRING+ EOF!
   //: exponentiationExpression EOF!
   ;

exponentiationExpression
    :       equalityExpression ( '^' equalityExpression )*
    ;

equalityExpression
    :       relationalExpression ( ( ('==' | '=' ) | ('!=' | '<>' ) ) relationalExpression )*
    ;

relationalExpression
    :       additiveExpression ( ( '>' | '>=' | '<' | '<=' ) additiveExpression )*
    ;

additiveExpression
    :       multiplicativeExpression ( ( '+' | '-' ) multiplicativeExpression )*
    ;

multiplicativeExpression
    :       primaryExpression ( ( '*' | '/' ) primaryExpression )*
    ;

primaryExpression
    :       '(' exponentiationExpression ')' | value | identifier (arguments )?
    ;

value
    :       STRING
    ;

identifier
    :       ID
    ;

expressionList
    :       exponentiationExpression ( ',' exponentiationExpression )*
    ;

arguments
    :       '(' ( expressionList )? ')'
    ;                      

/*
 * Lexer rules
 */

ID
    :       LETTER (LETTER | DIGIT)*
    ;

//STRING
//    :       '"' ( options { greedy=false; } : ~'"' )* '"'
//    ;
STRING
    :       '"' ('\\' ~('\r' | '\n') | ~('\\' | '"'| '\r' | '\n'))* '"'
    ;

WS
    :       (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN;} /*{$channel=Hidden;}*/
    ;

/*
 * Fragment Lexer rules
 */

fragment
LETTER
    :       'a'..'z'
    |       'A'..'Z'
    |       '_'
    ;

fragment
EXPONENT
    :       ('e'|'E') ('+'|'-')? ( DIGIT )+
    ;

fragment
HEX_DIGIT
    :       ( DIGIT |'a'..'f'|'A'..'F')
    ;

fragment
DIGIT
    :       '0'..'9'
    ;

which parses the input from my previous example in an identical parse tree.

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • Hi @BartKiers, the above Lexer rule works in the following scenarios: **"\"hello\""**, **"ref(\"hello\",\"hello\")"** but doesn't work for escaped double quotes scenarios: **"\"h\"e\"l\"l\"o\""**, **"ref(\"hello\", \"hel\"lo\")"**. Thanks for your help. –  Mar 06 '14 at 16:48
  • @bjrave, I have no idea how you're testing it, but it works just fine. Good luck though. – Bart Kiers Mar 06 '14 at 19:15
  • I agree your cut down grammar works fine in ANTLRWorks. Using ANTLRWorks and the compiled parser via C# code, the STRING Lexer rule doesn't appear to work once integrated with the rest of my grammar file. I've added a link to my grammar file should you wish to take a look. –  Mar 07 '14 at 09:44
  • 1
    @bjrave, no, I won't have a look: your grammar it full of embedded code which would mean I'd need to filter all of that out before I would be able to give it a spin. Perhaps if you post your grammar without code in your original question, I or someone else might have a look. An external link is often not looked at. – Bart Kiers Mar 07 '14 at 16:28
  • I see your point. I've stripped the embedded code from the grammar file. –  Mar 10 '14 at 17:51
  • @bjrave, in the grammar you stripped, you are still using the (icorrect) `STRING` rule: `'"' ( options { greedy=false; } : ~'"' )* '"'`. See my second **EDIT** – Bart Kiers Mar 12 '14 at 13:27
0

This is how I do this with strings that can contain escape sequences (not just \" but any):

DOUBLE_QUOTED_TEXT
@init { int escape_count = 0; }:
    DOUBLE_QUOTE
    (
        DOUBLE_QUOTE DOUBLE_QUOTE { escape_count++; }
        | ESCAPE_OPERATOR .  { escape_count++; }
        | ~(DOUBLE_QUOTE | ESCAPE_OPERATOR)
    )*
    DOUBLE_QUOTE
    { EMIT(); LTOKEN->user1 = escape_count; }
;

The rule additionally counts the escapes and stores them in the token. This allows the receiver to quickly see if it needs to do anything with the string (if user1 > 0). If you don't need that remove the @init part and the actions.

Mike Lischke
  • 48,925
  • 16
  • 119
  • 181
  • I've tried your suggestion but it doesn't appear to make any difference. I've updated my question to include the full grammar. –  Mar 06 '14 at 14:18