0

I'm trying to change a grammar in the JSqlParser project, which deals with a javacc grammar file .jj specifying the standard SQL syntax. I had difficulty getting one section to work, I narrowed it down to the following , much simplified grammar.

basically I have a def of Column : [table ] . field

but table itself could also contain the "." char, which causes confusion.

I think intuitively the following grammar should accept all the following sentences:

select mytable.myfield

select myfield

select mydb.mytable.myfield

but in practice it only accepts the 2nd and 3rd above. whenever it sees the ".", it progresses to demanding the 2-dot version of table (i.e. the first derivation rule for table)

how can I make this grammar work?

Thanks a lot Yang

    options{
        IGNORE_CASE=true ;
        STATIC=false;
            DEBUG_PARSER=true;
        DEBUG_LOOKAHEAD=true;
        DEBUG_TOKEN_MANAGER=false;
    //  FORCE_LA_CHECK=true;
        UNICODE_INPUT=true;
    }

    PARSER_BEGIN(TT)

    import java.util.*;

    public class TT {

    }
    PARSER_END(TT)


    ///////////////////////////////////////////// main stuff concerned
    void Statement() :
    { }
    {
    <K_SELECT> Column()
    }

    void Column():
    {
    }
    {
    [LOOKAHEAD(3) Table()  "." ]
    //[ 
    //LOOKAHEAD(2) (
    //      LOOKAHEAD(5) <S_IDENTIFIER> "."  <S_IDENTIFIER>  
    //      |
    //      LOOKAHEAD(3) <S_IDENTIFIER>
    //)
    //
    //
    //
    //]

    Field()
    }

    void Field():
    {}{
       <S_IDENTIFIER>
    }

    void Table():
    {}{
            LOOKAHEAD(5) <S_IDENTIFIER> "."  <S_IDENTIFIER>
            |
            LOOKAHEAD(3) <S_IDENTIFIER>
    }

    ////////////////////////////////////////////////////////



SKIP:
{
    " "
|   "\t"
|   "\r"
|   "\n"
}

TOKEN: /* SQL Keywords. prefixed with K_ to avoid name clashes */
{
<K_CREATE: "CREATE">
|
<K_SELECT: "SELECT">
}


TOKEN : /* Numeric Constants */
{
   < S_DOUBLE: ((<S_LONG>)? "." <S_LONG> ( ["e","E"] (["+", "-"])? <S_LONG>)?
                        |
                        <S_LONG> "." (["e","E"] (["+", "-"])? <S_LONG>)?
                        |
                        <S_LONG> ["e","E"] (["+", "-"])? <S_LONG>
                        )>
  |     < S_LONG: ( <DIGIT> )+ >
  |     < #DIGIT: ["0" - "9"] >
}


TOKEN:
{
        < S_IDENTIFIER: ( <LETTER> | <ADDITIONAL_LETTERS> )+ ( <DIGIT> | <LETTER> | <ADDITIONAL_LETTERS> | <SPECIAL_CHARS>)* >
|       < #LETTER: ["a"-"z", "A"-"Z", "_", "$"] >
|   < #SPECIAL_CHARS: "$" | "_" | "#" | "@">
|   < S_CHAR_LITERAL: "'" (~["'"])* "'" ("'" (~["'"])* "'")*>
|   < S_QUOTED_IDENTIFIER: "\"" (~["\n","\r","\""])+ "\"" | ("`" (~["\n","\r","`"])+ "`") | ( "[" ~["0"-"9","]"] (~["\n","\r","]"])* "]" ) >

/*
To deal with database names (columns, tables) using not only latin base characters, one
can expand the following rule to accept additional letters. Here is the addition of german umlauts.

There seems to be no way to recognize letters by an external function to allow
a configurable addition. One must rebuild JSqlParser with this new "Letterset".
*/
|   < #ADDITIONAL_LETTERS: ["ä","ö","ü","Ä","Ö","Ü","ß"] >
}
teddy teddy
  • 3,025
  • 6
  • 31
  • 48

4 Answers4

1

You could rewrite your grammar like this

Statement --> "select" Column
Column --> Prefix <ID>
Prefix --> (<ID> ".")*

Now the only choice is whether to iterate or not. Assuming a "." can't follow a Column, this is easily done with a lookahead of 2:

Statement --> "select" Column
Column --> Prefix <ID>
Prefix --> (LOOKAHEAD( <ID> ".") <ID> ".")*
Theodore Norvell
  • 15,366
  • 6
  • 31
  • 45
  • Thanks Theodore, but it doesn't seem to work. see my below answer (I had to open a new answer since comment doesn't allow pasting of code) – teddy teddy May 08 '15 at 18:32
  • Theodore: after changing the LOOKAHEAD() values to numeric, it works . thanks a lot – teddy teddy May 08 '15 at 19:08
  • but I have to separate out the Prefix() rule into Prefix --> Table() "." | {} because in real use cases, my original goal was to isolate the Table() definition to a single place. When I did this, the parser doesn't work . – teddy teddy May 08 '15 at 19:22
  • I suspect that the reason for the latest failure is that Javacc's lookahead() only looks as far as the length of the production itself. in this case the production can not be determined unless you look OUTSIDE all the rules of the production itself (i.e. in this case u need to look for EOF) – teddy teddy May 08 '15 at 19:24
  • The reason it only worked with numeric lookahead is that you had not used the same lookahead as in my answer. – Theodore Norvell May 08 '15 at 20:02
  • Unfortunately I don't see a good way to isolate Table in its own nonterminal in a way that will make it useful elsewhere. The problem is that in most contexts, `a.b` is a table, but in select, it is not, unless it is followed by another ".". – Theodore Norvell May 08 '15 at 20:04
  • in parsing the 2 sentences "SELECT table.field " and "Create TABLE table" , when "table" is encountered, the entire contents on the stack can be used to distinguish the derivation rules for table, but unfortunately in a LL parser, only the first symbol on stack is used, plus the inputstream, in this case there is no way to decide: in the "create " case we need to slurp in entire strings "a.b.c", while in "select" case we need to leave out the last ".field" part . so this seems to be a limitation of LL parser. LR should be able to do it – teddy teddy May 08 '15 at 21:30
  • Essentially, you are right, LR parsers are better at dealing with left context than LL parsers. In JavaCC, we are not limited to LL(1) because there are a lot of ways to use lookahead. Also you can use parameters to account for context. – Theodore Norvell May 09 '15 at 14:34
0

indeed the following grammar in flex+bison (LR parser) works fine , recognizing all the following sentences correctly:

create mydb.mytable create mytable select mydb.mytable.myfield select mytable.myfield select myfield

so it is indeed due to limitation of LL parser

%%

statement:
        create_sentence
        |
        select_sentence
        ;

create_sentence:  CREATE table
        ;

select_sentence: SELECT  table '.'  ID
                |
                SELECT ID
                ;

table : table '.' ID
        |
        ID
        ;



%%
teddy teddy
  • 3,025
  • 6
  • 31
  • 48
0

If you need Table to be its own nonterminal, you can do this by using a boolean parameter that says whether the table is expected to be followed by a dot.

void Statement():{}{
    "select" Column() | "create" "table" Table(false) }

void Column():{}{
    [LOOKAHEAD(<ID> ".") Table(true) "."] <ID> }

void Table(boolean expectDot):{}{
    <ID> MoreTable(expectDot) }

void MoreTable(boolean expectDot) {
    LOOKAHEAD("." <ID> ".", {expectDot}) "." <ID> MoreTable(expectDot)
|
    LOOKAHEAD(".", {!expectDot}) "." <ID> MoreTable(expectDot)
|
    {}
}

Doing it this way precludes using Table in any syntactic lookahead specifications either directly or indirectly. E.g. you shouldn't have LOOKAHEAD( Table()) anywhere in your grammar, because semantic lookahead is not used during syntactic lookahead. See the FAQ for more information on that.

Theodore Norvell
  • 15,366
  • 6
  • 31
  • 45
  • Thanks, that's interesting, first time seen this syntax. Im. Curious how this translates to the LL parsing sequence ? In bison u can look at the generated state table, but with javacc the only available diagnosis is the java code, which is quite cryptic – teddy teddy May 09 '15 at 16:35
  • it's very interesting in that in the case of LR parser, as soon as the parser sees SELECT vs CREATE, the parsing is immediately directed to 2 different states. but in a LL parser, the derivation is chosen, but after matching up the head of input (create or select), the stuff remaining on stack is both "table" (plus something), there is no way for the parser to distingush. in other words, LR parser had that boolean built in, while LL has to specify with a variable – teddy teddy May 10 '15 at 07:35
  • JavaCC is not LL(1) or really LL anything. It generates recursive descent parsers. By default it resolve choices on the basis of the next token of input, so it is very similar to LL(1) by default, but the flexibility of lookahead specifications makes it easy to use grammars that are not LL(1). – Theodore Norvell May 11 '15 at 15:21
  • Yes I mean it's the limitation of all LL(k) parsers. The improvement version you had actually had 2 completely unrelated "table" nonterminals, in effect. LL chops off already matched terminals and nonterminals, LR keeps them on stack ( or encoded through state), so has more info for parsing decision – teddy teddy May 12 '15 at 16:06
  • Yes. This is why every LR(k) language is LL(k), but the converse is not true. However you can't extrapolate from this theoretical result to say that Bison is more expressive than JavaCC for at least three reasons. First Bison is based on LALR, not LR(1), and LALR is not a superset of LL(1). Second, Bison accepts grammars that aren't LALR -- although with a warning. Third, JavaCC is only loosely based on LL(1) and can be used with lots of grammars that are not LL(1). – Theodore Norvell May 12 '15 at 19:53
  • if I try to use an arbitrary boolean expression in the semantic look ahead, that includes the variable, Javacc fails to include that variable in the method call chains generated. for example I see private boolean jj_3R_1() { if (!jj_rescan) trace_call("Table(LOOKING AHEAD...)"); Token xsp; xsp = jj_scanpos; jj_lookingAhead = true; jj_semLA = (more = more); jj_lookingAhead = false; if (!jj_semLA || jj_3_2()) { – teddy teddy May 12 '15 at 20:56
  • I know. Weird isn't it? This is documented in the FAQ. This is why I said you wouldn't be able to use Table in syntactic lookahead. – Theodore Norvell May 12 '15 at 23:50
0

Your examples are parsed perfectly well using JSqlParser V0.9.x (https://github.com/JSQLParser/JSqlParser)

CCJSqlParserUtil.parse("SELECT mycolumn");
CCJSqlParserUtil.parse("SELECT mytable.mycolumn");
CCJSqlParserUtil.parse("SELECT mydatabase.mytable.mycolumn");
wumpz
  • 8,257
  • 3
  • 30
  • 25