Despite my limited knowledge about compiling/parsing I dared to build a small recursive-descent parser for OData $filter expressions. The parser only needs to check the expression for correctness and output a corresponding condition in SQL. As input and output have almost the same tokens and structure, this was fairly straightforward, and my implementation does 90% of what I want.
But now I got stuck with parentheses, which appear in separate rules for logical and arithmetic expressions. The full OData grammar in ABNF is here, a condensed version of the rules involved is this:
boolCommonExpr = ( boolMethodCallExpr
/ notExpr
/ commonExpr [ eqExpr / neExpr / ltExpr / ... ]
/ boolParenExpr
) [ andExpr / orExpr ]
commonExpr = ( primitiveLiteral
/ firstMemberExpr ; = identifier
/ methodCallExpr
/ parenExpr
) [ addExpr / subExpr / mulExpr / divExpr / modExpr ]
boolParenExpr = "(" boolCommonExpr ")"
parenExpr = "(" commonExpr ")"
How does this grammar match a simple expression like (1 eq 2)
? From what I can see all (
are consumed by the rule parenExpr
inside commonExpr
, i.e. they must also close after commonExpr
to not cause an error and boolParenExpr
never gets hit. I suppose my experience / intuition on reading such a grammar is just insufficient to get it. A comment in the ABNF says: "Note that boolCommonExpr is also a commonExpr". Maybe that's part of the mystery?
Obviously an opening (
alone won't tell me where it's going to close: After the current commonExpr
expression or further away after boolCommonExpr
. My lexer has a list of all tokens ahead (URL is very short input). I was thinking to use that to find out what type of (
I have. Good idea?
I'd rather have restrictions in input or a little hack than switching to a generally more powerful parser model. For a simple expression translation like this I also want to avoid compiler tools.
Edit 1: Extension after answer by rici - Is grammar rewrite correct?
Actually I started out with the example for recursive-descent parsers given on Wikipedia. Then I though to better adapt to the official grammar given by the OData standard to be more "conformant". But with the advice from rici (and the comment from "Internal Server Error") to rewrite the grammar I would tend to go back to the more comprehensible structure provided on Wikipedia.
Adapted to the boolean expression for the OData $filter this could maybe look like this:
boolSequence= boolExpr {("and"|"or") boolExpr} .
boolExpr = ["not"] expression ("eq"|"ne"|"lt"|"gt"|"lt"|"le") expression .
expression = term {("add"|"sum") term} .
term = factor {("mul"|"div"|"mod") factor} .
factor = IDENT | methodCall | LITERAL | "(" boolSequence")" .
methodCall = METHODNAME "(" [ expression {"," expression} ] ")" .
Does the above make sense in general for boolean expressions, is it mostly equivalent to the original structure above and digestible for a recursive descent parser?
@rici: Thanks for your detailed remarks on type checking. The new grammar should resolve your concerns about precedence in arithmetic expressions.
For all three terminals (UPPERCASE in the grammar above) my lexer supplies a type (string, number, datetime or boolean). Non-terminals return the type they produce. With this I managed quite nicely do type checking on the fly in my current implementation, including decent error messages. Hopefully this will also work for the new grammar.
Edit 2: Return to original OData grammar
The differentiation between a "logical" and "arithmetic" (
is not a trivial one. To solve the problem even N.Wirth uses a dodgy workaround to keep the grammar of Pascal simple. As a consequence, in Pascal an extra pair of ()
is mandatory around and
and or
expressions. Neither intuitive nor OData conformant :-(. The best read about the "() difficulty" I found is in Let's Build a Compiler (Part VI). Other languages seem to go to great length in the grammar to solve the problem. As I don't have experience with grammar construction I stopped doing my own.
I ended up implementing the original OData grammar. Before I run the parser I go over all tokens backwards to figure out which (
belong to a logical/arithmetic expression. Not a problem for the potential length of a URL.