I'm trying to create a logical boolean expression parser that evaluates if the words in the expression correlate to the content of the document.
After many many hours of research (I had no idea about parsers and all their theory before this), I find myself unable to successfully create a valid grammar that does not suffer from left hand recursion, or just straight up something that works correctly, and I'm getting quite frustrated as none of this theory was explained to us in class.
Basically, I need to parse expressions of the likes of: ({w1 w2 w3} & !w4) | (w5 & "mark likes food"), where {} represents a set of words that must be in the document, and "" a string literal that must be in the document.
I came up with the tokens [AND, OR, NOT, LPAREN, RPAREN, LSET, RSET, LSEQ, RSEQ, WRD]. So that the expression (w1 & {w2 w2}) would become [LPAREN WRD AND LSET WRD WRD RSET RPAREN].
But I'm having trouble with the grammar that makes it possible to be parsed. On my very first try I came up with:
S -> E
E -> T AND T | T OR T | T
T -> LSET W RSET | LSEQ W RSEQ | LPAREN E RPAREN |NOT E |WRD
W -> WRD* [don't really know how to write this formally, but this only accepts an array of WRD tokens until RPAREN or RSEQ depending on which one it started with.]
This obviously doesn't work at all because the expression isn't evaluated entirely (it stops after first return), the parenthesis are not correctly taken care of, among other problems. It has been a couple of days I can't seem to come up with something useful, pls help.
For the code I took inspiration in this but I think it doesn't really fit my problem.
code (I've tested the tokenizer and it works correctly):
public class BoolExprParser {
private final String expression;
private final BoolExprTokenizer tokenizer;
private BoolExprTokenizer.Token currentToken;
private void advance() {
currentToken = tokenizer.getNext();
}
private boolean currentEquals(BoolExprTokenizer.Token t) {
return currentToken == t;
}
private boolean parse(Document doc) {
advance();
boolean val = expr(doc);
if (!currentEquals(BoolExprTokenizer.Token.END)) {
// error
}
return val;
}
private boolean expr(Document doc) {
boolean leftExpr = subExpr(doc);
switch (currentToken) {
case AND:
advance();
boolean rightExpr = subExpr(doc);
return leftExpr && rightExpr;
case OR:
advance();
rightExpr = subExpr(doc);
return leftExpr || rightExpr;
case END:
return leftExpr;
default:
//error
}
return false;
}
private boolean subExpr(Document doc) {
switch (currentToken) {
case NOT:
advance();
boolean result = expr(doc);
return !result;
case WRD:
advance();
return doc.isWord(tokenizer.getWord());
case LSet:
advance();
boolean wordsInsideSet = wordsSet(doc);
if (!currentEquals(BoolExprTokenizer.Token.RSet)) {
// error
} else {
advance();
}
return wordsInsideSet;
case LP:
advance();
boolean exprInside = expr(doc);
if (!currentEquals(BoolExprTokenizer.Token.RP)) {
// error
} else {
advance();
}
return exprInside;
default:
// error
}
return false;
}
private boolean wordsSet(Document doc) {
boolean validToken = currentEquals(BoolExprTokenizer.Token.WRD);
boolean isInDoc = true;
while (validToken) {
if (isInDoc) isInDoc = doc.isWord(tokenizer.getWord());
advance();
validToken = currentEquals(BoolExprTokenizer.Token.WRD);
}
if (!currentEquals(BoolExprTokenizer.Token.RSet)) {
// error
}
return isInDoc;
}
}
`