1

For some input, the parser presents a "Possible kinds of longer matches : { <EXPRESSION>, <TEXT> }", but for some odd reason it chooses the wrong one.

This is the source:


SKIP :
{
  " "  
| "\r"
| "\t"
| "\n"
}

TOKEN :
{
  < DOT : "." >
| < LBRACE : "{" >
| < RBRACE : "}" >
| < LBRACKET: "[" >
| < RBRACKET: "]" >
| < #LETTER : [ "a"-"z" ] >
| < #DIGIT : [ "0"-"9" ] >
| < #IDENTIFIER: < LETTER > (< LETTER >)* >
| < EXPRESSION : (< IDENTIFIER> < DOT > < IDENTIFIER> < DOT > < IDENTIFIER> ((< DOT > < IDENTIFIER> )* | < LBRACKET > (< DIGIT>)* < RBRACKET >)*)*>
| < TEXT : (( < DOT >)* ( < LETTER > )+ (< DOT >)*)* >
}

void q0() :
{Token token = null;}
{
    (
        < LBRACE > expression() < RBRACE >
    |   ( token = < TEXT >
            {
              getTextTokens().add( token.image );
            }
        )
    )* < EOF >
}


void expression() :
{Token token = null;}
{
  < EXPRESSION >
}

If we try to parse "a.bc.d" using this grammar it would say " FOUND A <EXPRESSION> MATCH (a.bc.d) "

My question is why did it choose to parse the input as an <EXPRESSION> instead of <TEXT>?

Also, how can I force the parser to choose the right path? I have tried countless LOOKAHEAD scenarios with no success.

The right path is for instance <TEXT> when using "a.bc.d" as input, and <EXPRESSION> for "{a.bc.d}".

Thanks in advance.

cocalars
  • 11
  • 2
  • Gunther has answered the first question: "why did it choose to parse the input as an instead of ?". The answer to the second question: "how can I force the parser to choose the right path?" is hard for us to answer, as you haven't defined "right". If you simply exchange the order of the productions for "EXPRESSION" and "TEXT", you may find that the token manager sometimes chooses TEXT when the "right" choice is EXPRESSION. – Theodore Norvell May 19 '13 at 14:27
  • In case exchanging TEXT and EXPRESSION is not sufficient, here are some suggestions: (0) Consider using lexical states; if TEXT and EXPRESSION apply in different lexical states, there is. no ambiguity. (1) Consider doing more at the parser level and less at the token manager (lexical) level. (2) Consider using MORE to break up complex regular expressions. – Theodore Norvell May 19 '13 at 14:28
  • By the way, I have updated the text above with the definition for "right path". – cocalars May 19 '13 at 23:39

2 Answers2

2

From the JavaCC FAQ:

If more than one regular expression describes the longest possible prefix, then the regular expression that comes first in the .jj file is used.

So a preference can be established by ordering ambiguous definitions accordingly.

Gunther
  • 5,146
  • 1
  • 24
  • 35
1

If expressions only appear within { braces }, only expressions (and white space) appear in braces, and braces are only used to delimit expressions, then you can do something like the following. See question 3.11 in the faq, if you are not familiar with lexical states.

// The following abbreviations hold in any state.
TOKEN : {
  < #LETTER : [ "a"-"z" ] >
| < #DIGIT : [ "0"-"9" ] >
| < #IDENTIFIER: < LETTER > (< LETTER >)* >
}

// Skip white space in either state
<DEFAULT,INBRACES> SKIP : { " "  | "\r" | "\t" | "\n" }

// The following are recognized in the default state.
// A left brace forces a switch to the INBRACES state.
<DEFAULT> TOKEN : {
  < DOT : "." >
| < LBRACE : "{" > : INBRACES
| < LBRACKET: "[" >
| < RBRACKET: "]" >
| < TEXT : (( < DOT >)* ( < LETTER > )+ (< DOT >)*)* >
}

// A right brace forces a switch to the DEFAULT state.
<DEFAULT, INBRACES > TOKEN {
  < RBRACE : "}"  > : DEFAULT
}

// Expressions are only recognized in the INBRACES state.
<INBRACES> TOKEN : {
  < EXPRESSION : (< IDENTIFIER> < DOT > < IDENTIFIER> < DOT > < IDENTIFIER> ((< DOT > < IDENTIFIER> )* | < LBRACKET > (< DIGIT>)* < RBRACKET >)*)*>
}

It looks a bit dodgy that DOT is defined in one state and used in another. However, I think that it works fine.

Theodore Norvell
  • 15,366
  • 6
  • 31
  • 45
  • Theodore, thank you again for your help. Lexical states do the job. This is a great example on how to use them, since these examples are so hard to find. – cocalars May 22 '13 at 18:02