0

I have a very peculiar requirement to parse inputs using ANTLR. I would like to be able to parse expressions like;

Correct Inputs

  • user name
  • user_name user-name
  • | EATALL any thing could come here/ok | EATALL ...

Invalid Inputs

  • user/name
  • user&name^face

Well, any expressions which come after | EATALL & before | EATALL(if any) must be obtained as a single token. While in case of other simple inputs where | EATALL doesn't appear, only valid combination of _, -, [a-zA-Z0-9] is tokenized as a one token. In pseudocode,

  • user name -> [user] [name]
  • user_name -> [user_name]
  • |EATALL user/name my user -> [user/name my user]

This already seems like an ambiguous case of tokenization for me. I am seeking your suggestions on dealing problems like these in antlr. Thanking you in advanced.

consumer
  • 709
  • 3
  • 7
  • 19
  • On first glance it looks as if your problem can be handled by a regex and doesn't require a context-free grammar. So I don't see the need for Antlr. – Richard Wеrеzaк Apr 03 '13 at 04:50
  • This is just a very small part of the input parsing problem. I am specifically asking help regarding antlr. – consumer Apr 03 '13 at 04:54

1 Answers1

0

So, what have you tried? Is you question specific to Antlr 3 or can you use Antlr 4?

For Antlr 3, you can use semantic predicates to condition token rule selection. Since Antlr 4 does not have symbolic semantic predicates, you can use native code actions to achieve essentially the same result. For example (untested):

lexer grammar eatall ;

ValidSimple : { isCurrenLineJustTEXTandWS() }? TEXT ;
-- or --
ValidSimple : TEXT ( WS TEXT)* EOL?  { emitEachTEXTasNewValidSimpleToken(); } ;

ValidEatAll : IgnoreL .*? IgnoreR    { trimIgnoreLRTextfromTokenText(); } ;
Invalid     : WS+ | .*? EOL?         -> channel(HIDDEN) ;

IgnoreL : .*? MARK ;
IgnoreR : MARK .*? EOL? ;

fragment MARK : '| EATALL' ;
fragment TEXT : [a-zA-Z0-9_-] ;
fragment EOL  : '\r'? '\n' ;
fragment WS   : [ \t] ;
GRosenberg
  • 5,843
  • 2
  • 19
  • 23
  • I am specifically talking in context of Antlrv3. Literally i am stuck. I don't know how to force lexer to create one big token when EATALL appears and distinct tokens in case of simple input. – consumer Apr 04 '13 at 05:14
  • As I said, use a semantic predicate to enable a 'EatAll' rule. Put that rule above the simple input rule. Use a native code action on the EadAll rule to trim the unwanted text from the token text. – GRosenberg Apr 04 '13 at 07:41
  • Use a [non-greedy wildcard](http://www.antlr.org/wiki/display/ANTLR4/Wildcard+Operator+and+Nongreedy+Subrules#WildcardOperatorandNongreedySubrules-NongreedyLexerSubrules) in the EatAll rule to consume the text between the MARKs. – GRosenberg Apr 04 '13 at 07:49
  • That link goes back to the Antlr4 docs. For Antlr4, .*? is nongreedy. For Antlr3.5 grammars, the .* construct is nongreedy by default. – GRosenberg Apr 04 '13 at 08:04
  • ok ... i will check your suggestions now. Thanks for your time. :) – consumer Apr 04 '13 at 08:37