Antlr Lexer rules

Question

I've got a rule to match a string that looks like so:

STRING
    : '"' ( ~( '"' | '\\' ) | '\\' . )* '"'
    ;

I dont want the quotes to be part of the tokens text. In Antlr2 I would just put '!' after the quotes to tell Antlr not to add them to the text.

Notice the '!' below.

 STRING
    : '"'! ( ~( '"' | '\\' ) | '\\' . )* '"'!
    ;

However in Antlr3 I can no longer do this as I get the error:

warning(149): Crv__.g:0:0: rewrite syntax or operator with no output option; setting output=AST

I don't know if I can use a rewrite rule here as I don't know how to write the match everything token '.'

My only other thought is to grab the matched text and return it without the quotes, but I'm not sure how to do that as the token hasn't been created yet.

I'm using the C Antlr runtime. How can I accomplish this?

chollida · Accepted Answer · 2011-08-17T12:20:48.707

2

For posterity I'll mention how I ended up solving this.

I used an @after block to strip the quotes

STRING
@after
{
    SETTEXT(GETTEXT()->substring(GETTEXT(),1,GETTEXT()->len-1))
}
: '"' ( ~( '"' | '\\' ) | '\\' . )* '"'
;

edited Aug 17 '11 at 12:20

answered Aug 16 '11 at 20:47

chollida

7,834
11
55
85

You'll want to remove the exclamation marks in that case. Also, you're now only removing the quotes, but are leaving the backslashes that possibly in there escaping other chars: I would expect them to be removed when the quotes are stripped from the token. – Bart Kiers Aug 17 '11 at 08:05
@Bart thanks! You're correct. Infact the initial problem was that ANTLR 3 doesn't allow the exclamation mark:) I had retyped my answer from memory. I've updated my answer. – chollida Aug 17 '11 at 12:22
Isn't there any better solution than adding such time consuming operation ? – Nicolas Thery Jan 10 '12 at 17:01
@Nicolas possibly, like I said, I put the solution I ended up using. since there was no other solution this one ended up being the accepted solution. Do you have another solution? – chollida Jan 11 '12 at 22:05
@chollida I posted my solution, dunno which one is the best. I guess i will roll back to your solution as I don't like how c=CHAR is implemented. – Nicolas Thery Jan 12 '12 at 11:00

score 0 · Answer 2 · answered Jan 12 '12 at 10:59

This is the solution I ended up using :

STRING          :       '"'         { \$s = ""; }
                (   '"' '"'         { \$s .= '"';}
                |   c=CHAR          { \$s .= \$c->gettext();}
                |   ' '             { \$s .= ' ';}
                )*
                '"'                 { \$this->setText(\$s); }
    ;



fragment CHAR       :   (ACCENT|SPECIAL|ALPHA|DIGIT);
fragment ACCENT     :   '\u00C0'..'\u00D6' | '\u00D9'..'\u00DD' | '\u00E0'..'\u00F6' |'\u00F9'..'\u00FD';
fragment SPECIAL    :   '.' | '!' | '-'| '?';
fragment ALPHA      :   'a'..'z' | 'A'..'Z';
fragment DIGIT      :   '0'..'9' ;

There is one minor difference that is I have a white list of character for security reasons.

But the major difference is that I build the result string incrementally, tossing the " char.

I'm in PHP language, that's why there are \$ Do you know which one is faster ?

the biggest difference I see is that my solution uses the wildcard '.' to match any symbol. You have to specify each symbol in a list. For instance your string cant' currently contain many common punctuation elements, such as a semi colon ':', though you can fix this. You do have a good solution, I wish I had thought of it earlier. — chollida, Jan 12 '12 at 18:41

Antlr Lexer rules

2 Answers2