0

I'm writing a JAVA software to parse SQL queries. In order to do so I'm using ANTLR with presto.g4. The code I'm currently using is pretty standard:

PrestoLexer lexer = new PrestoLexer(
              new CaseChangingCharStream(CharStreams.fromString(query), true));

      lexer.removeErrorListeners();
      lexer.addErrorListener(errorListener);

      CommonTokenStream tokens = new CommonTokenStream(lexer);
      PrestoParser parser = new PrestoParser(tokens);

I wonder whether it's possible to pass a parameter to the lexer so the lexing will be different depends on that parameter?

update: I've used @Mike's suggestion below and my lexer now inherits from the built-in lexer and added a predicate function. My issue is now pure grammar.

This is my string definition:


STRING
    : '\'' ( '\\' .
           | '\\\\'  .  {HelperUtils.isNeedSpecialEscaping(this)}?       // match \ followed by any char
           | ~[\\']       // match anything other than \ and '
           | '\'\''       // match ''
           )*
      '\''
    ;

I sometimes have a query with weird escaping for which the predicate returns true. For example:


select 
table1(replace(replace(some_col,'\\'',''),'\"' ,'')) as features 
from table1

And when I try to parse it I'm getting: '\'',''),'

As a single string. how can I handle this one?

Nir99
  • 185
  • 3
  • 15

2 Answers2

1

I don't know what you need the parameter for, but you mentioned SQL, so let me present a solution I used since years: predicates.

In MySQL (which is the dialect I work with) the syntax differs depending on the MySQL version number. So in my grammar I use semantic predicates to switch off and on language parts that belong to a specific version. The approach is simple:

test:
    {serverVersion < 80014}? ADMIN_SYMBOL
    | ONLY_SYMBOL
;

The ADMIN keyword is only acceptable for version < 8.0.14 (just an example, not true in reality), while the ONLY keyword is a possible alternative in any version.

The variable serverVersion is a member of a base class from which I derive my parser. That can be specified by:

options {
    superClass = MySQLBaseRecognizer;
    tokenVocab = MySQLLexer;
}

The lexer also is derived from that class, so the version number is available in both lexer and parser (in addition to other important settings like the SQL mode). With this approach you can also implement more complex functions for predicates, that need additional processing.

You can find the full code + grammars at the MySQL Workbench Github repository.

Mike Lischke
  • 48,925
  • 16
  • 119
  • 181
  • thanks @Mike! this was in fact very useful. I have other issue with the parsing now. I'll elaborate in the original post. – Nir99 Dec 30 '20 at 11:31
  • @Nir99 better open a new question for the grammar part on SO, which will allow others to chime in for solutions. – Mike Lischke Dec 30 '20 at 14:17
  • already opened one- https://stackoverflow.com/questions/65506538/handling-different-escaping-sequences – Nir99 Dec 30 '20 at 14:22
0

I wonder whether it's possible to pass a parameter to the lexer so the lexing will be different depends on that parameter?

No, the lexer works independently from the parser. You cannot direct the lexer while parsing.

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288