21

I've been looking through the ANTLR v3 documentation (and my trusty copy of "The Definitive ANTLR reference"), and I can't seem to find a clean way to implement escape sequences in string literals (I'm currently using the Java target). I had hoped to be able to do something like:

fragment 
ESCAPE_SEQUENCE
    : '\\' '\'' { setText("'"); }
    ;

STRING  
    : '\'' (ESCAPE_SEQUENCE | ~('\'' | '\\'))* '\''
      { 
        // strip the quotes from the resulting token
        setText(getText().substring(1, getText().length() - 1));
      } 
    ;

For example, I would want the input token "'Foo\'s House'" to become the String "Foo's House".

Unfortunately, the setText(...) call in the ESCAPE_SEQUENCE fragment sets the text for the entire STRING token, which is obviously not what I want.

Is there a way to implement this grammar without adding a method to go back through the resulting string and manually replace escape sequences (e.g., with something like setText(escapeString(getText())) in the STRING rule)?

Sam Martin
  • 1,030
  • 1
  • 8
  • 16

4 Answers4

16

Here is how I accomplished this in the JSON parser I wrote.

STRING      
@init{StringBuilder lBuf = new StringBuilder();}
    :   
           '"' 
           ( escaped=ESC {lBuf.append(getText());} | 
             normal=~('"'|'\\'|'\n'|'\r')     {lBuf.appendCodePoint(normal);} )* 
           '"'     
           {setText(lBuf.toString());}
    ;

fragment
ESC
    :   '\\'
        (   'n'    {setText("\n");}
        |   'r'    {setText("\r");}
        |   't'    {setText("\t");}
        |   'b'    {setText("\b");}
        |   'f'    {setText("\f");}
        |   '"'    {setText("\"");}
        |   '\''   {setText("\'");}
        |   '/'    {setText("/");}
        |   '\\'   {setText("\\");}
        |   ('u')+ i=HEX_DIGIT j=HEX_DIGIT k=HEX_DIGIT l=HEX_DIGIT
                   {setText(ParserUtil.hexToChar(i.getText(),j.getText(),
                                                 k.getText(),l.getText()));}

        )
    ;
Peter Štibraný
  • 32,463
  • 16
  • 90
  • 116
Bruno Ranschaert
  • 7,428
  • 5
  • 36
  • 46
  • 4
    I used this approach, but note that I had to append "getText()" instead of "escaped.getText()" at each step. The fragment writes the unescaped text to the entire STRING token, which getText() returns. For me, escaped.getText() returns the original fragment with backslashes intact. – Sam Martin Mar 20 '09 at 14:39
6

For ANTLR4, Java target and standard escaped string grammar, I used a dedicated singleton class : CharSupport to translate string. It is available in antlr API :

STRING          :   '"' 
                (   ESC  
                |   ~('"'|'\\'|'\n'|'\r') 
                )* 
                    '"' { 
                        setText( 
                            org.antlr.v4.misc.CharSupport.getStringFromGrammarStringLiteral(
                                getText()
                            )
                        ); 
                    }
                ;

As I saw in V4 documentation and by experiments, @init is no longer supported in lexer part!

jeanmi
  • 161
  • 1
  • 3
4

Another (possibly more efficient) alternative is to use rule arguments:

STRING
@init { final StringBuilder buf = new StringBuilder(); }
:
    '"'
    (
    ESCAPE[buf]
    | i = ~( '\\' | '"' ) { buf.appendCodePoint(i); }
    )*
    '"'
    { setText(buf.toString()); };

fragment ESCAPE[StringBuilder buf] :
    '\\'
    ( 't' { buf.append('\t'); }
    | 'n' { buf.append('\n'); }
    | 'r' { buf.append('\r'); }
    | '"' { buf.append('\"'); }
    | '\\' { buf.append('\\'); }
    | 'u' a = HEX_DIGIT b = HEX_DIGIT c = HEX_DIGIT d = HEX_DIGIT { buf.append(ParserUtil.hexChar(a, b, c, d)); }
    );
Trevor Robinson
  • 15,694
  • 5
  • 73
  • 72
3

I needed to do just that, but my target was C and not Java. Here's how I did it based on answer #1 (and comment), in case anyone needs something alike:

QUOTE   :      '\'';
STR
@init{ pANTLR3_STRING unesc = GETTEXT()->factory->newRaw(GETTEXT()->factory); }
        :       QUOTE ( reg = ~('\\' | '\'') { unesc->addc(unesc, reg); }
                        | esc = ESCAPED { unesc->appendS(unesc, GETTEXT()); } )+ QUOTE { SETTEXT(unesc); };

fragment
ESCAPED :       '\\'
                ( '\\' { SETTEXT(GETTEXT()->factory->newStr8(GETTEXT()->factory, (pANTLR3_UINT8)"\\")); }
                | '\'' { SETTEXT(GETTEXT()->factory->newStr8(GETTEXT()->factory, (pANTLR3_UINT8)"\'")); }
                )
        ;

HTH.

  • This works fine except that the final result is transformed in upper-case. Do you know how to fix the grammar to return the same case? –  Sep 06 '19 at 16:45