0

I need to parse user input that defines queries to a system. The heart of such queries are triplets which can also be combined to form complex queries (the idea is to restrict a result set to only show entries which satisfy these queries). Here are 3 sample inputs:

field1 = simpleValueNoQuotes
field2 ~ "valueWithQuotes"
(field1 = simpleValueNoQuotes OR field2 ~ "valueWithQuotes") AND field3 = foobar

The user must use quoted values if their values contain any reserved characters like doublequotes or parentheses as well as whitespace.

So far, my grammar has handled this well enough, but now a new requirement has come up. Users should be allowed to omit the spaces, entering queries like field1=simpleValueNoQuotes. My grammar can't handle this and I can't seem to figure out why (this is my first project with antlr).

Here is my grammar in a slightly simplified form:

grammar simple;

querytree   :   query EOF;

query   :   subquery (operator subquery)* ;

subquery    :   leaf | composite;

operator    :   'and' | 'or';

leaf    :   fieldname comparison value;

value   :   DOUBLEQUOTE_DELIMITED_VALUE | SIMPLE_VALUE;

composite   :   leftParenthesis query rightParenthesis; 

fieldname   :   'field1' | 'field2'; //this has many keywords in reality

comparison  :   '=' | '~';

leftParenthesis     :   '(';
rightParenthesis    :   ')';

fragment
ESCAPE  :   '\\' ( '"' | '\\') ;

DOUBLEQUOTE_DELIMITED_VALUE 
:   '"' ( ~( '"' | '\\' ) | ESCAPE )* '"'
;   

SIMPLE_VALUE
:   ('\u0021'|'\u0023'..'\u0027'|'\u002A'..'\u007E'|'\u00A1'..'\uFFFF')*;   /*all unicode characters except control characters, doublequotes, parentheses and whitespace defined below*/

WHITESPACE
:   ('\u0009'|'\u000A'|'\u000C'|'\u000D'|'\u0020'|'\u00A0')+    {$channel = HIDDEN;}   /*\t, \n, \f, \r, space, nonbreaking space*/
;   

Any ideas as to why this is able to parse field1 = simpleValueNoQuotes but unable to parse field1=simpleValueNoQuotes?

peedee
  • 3,257
  • 3
  • 24
  • 42
  • SIMPLE_VALUE matches an empty token which is most of the time a problem in itself, so either make it a ()+ or optional in the value rule. Can you give us two examples, one that matches and one which does not? – Mike Lischke Apr 03 '13 at 11:10
  • thanks for that hint about empty token, it makes sense. unfortunately it doesn't address the bigger issue. I gave an example for matching and non-matching input at the end, after my grammar, did you see that? – peedee Apr 03 '13 at 11:26

1 Answers1

0

You forgot to exclude = from SIMPLE_VALUE, which means field1=simpleValueNoQuotes is a single SIMPLE_VALUE token.

Sam Harwell
  • 97,721
  • 20
  • 209
  • 280
  • You are absolutely right, that does solve the problem, but I'm not sure I understand why. Shouldn't the "comparison" rule which matches =/~ take precedence over SIMPLE_VALUE? Why does field1 not get matched to SIMPLE_VALUE? – peedee Apr 03 '13 at 13:30
  • Longest match always takes precedence over a shorter match, so it takes `field1=simpleValueNoQuotes` over `field1`, so the tokens defined by the `comparison` rule never even get a chance. – Sam Harwell Apr 03 '13 at 13:34