Antlr parser for custom requirement

Question

I have a very peculiar requirement to parse inputs using ANTLR. I would like to be able to parse expressions like;

Correct Inputs

user name
user_name user-name
| EATALL any thing could come here/ok | EATALL ...

Invalid Inputs

user/name
user&name^face

Well, any expressions which come after | EATALL & before | EATALL(if any) must be obtained as a single token. While in case of other simple inputs where | EATALL doesn't appear, only valid combination of _, -, [a-zA-Z0-9] is tokenized as a one token. In pseudocode,

user name -> [user] [name]
user_name -> [user_name]
|EATALL user/name my user -> [user/name my user]

This already seems like an ambiguous case of tokenization for me. I am seeking your suggestions on dealing problems like these in antlr. Thanking you in advanced.

On first glance it looks as if your problem can be handled by a regex and doesn't require a context-free grammar. So I don't see the need for Antlr. — Richard Wеrеzaк, Apr 03 '13 at 04:50
This is just a very small part of the input parsing problem. I am specifically asking help regarding antlr. — consumer, Apr 03 '13 at 04:54

score 0 · Accepted Answer · answered Apr 03 '13 at 20:00

0

So, what have you tried? Is you question specific to Antlr 3 or can you use Antlr 4?

For Antlr 3, you can use semantic predicates to condition token rule selection. Since Antlr 4 does not have symbolic semantic predicates, you can use native code actions to achieve essentially the same result. For example (untested):

lexer grammar eatall ;

ValidSimple : { isCurrenLineJustTEXTandWS() }? TEXT ;
-- or --
ValidSimple : TEXT ( WS TEXT)* EOL?  { emitEachTEXTasNewValidSimpleToken(); } ;

ValidEatAll : IgnoreL .*? IgnoreR    { trimIgnoreLRTextfromTokenText(); } ;
Invalid     : WS+ | .*? EOL?         -> channel(HIDDEN) ;

IgnoreL : .*? MARK ;
IgnoreR : MARK .*? EOL? ;

fragment MARK : '| EATALL' ;
fragment TEXT : [a-zA-Z0-9_-] ;
fragment EOL  : '\r'? '\n' ;
fragment WS   : [ \t] ;

answered Apr 03 '13 at 20:00

GRosenberg

5,843
2
19
23

I am specifically talking in context of Antlrv3. Literally i am stuck. I don't know how to force lexer to create one big token when EATALL appears and distinct tokens in case of simple input. – consumer Apr 04 '13 at 05:14
As I said, use a semantic predicate to enable a 'EatAll' rule. Put that rule above the simple input rule. Use a native code action on the EadAll rule to trim the unwanted text from the token text. – GRosenberg Apr 04 '13 at 07:41
Use a [non-greedy wildcard](http://www.antlr.org/wiki/display/ANTLR4/Wildcard+Operator+and+Nongreedy+Subrules#WildcardOperatorandNongreedySubrules-NongreedyLexerSubrules) in the EatAll rule to consume the text between the MARKs. – GRosenberg Apr 04 '13 at 07:49
That link goes back to the Antlr4 docs. For Antlr4, .*? is nongreedy. For Antlr3.5 grammars, the .* construct is nongreedy by default. – GRosenberg Apr 04 '13 at 08:04
ok ... i will check your suggestions now. Thanks for your time. :) – consumer Apr 04 '13 at 08:37

Antlr parser for custom requirement

1 Answers1