0

I'm trying to create a logical boolean expression parser that evaluates if the words in the expression correlate to the content of the document.

After many many hours of research (I had no idea about parsers and all their theory before this), I find myself unable to successfully create a valid grammar that does not suffer from left hand recursion, or just straight up something that works correctly, and I'm getting quite frustrated as none of this theory was explained to us in class.

Basically, I need to parse expressions of the likes of: ({w1 w2 w3} & !w4) | (w5 & "mark likes food"), where {} represents a set of words that must be in the document, and "" a string literal that must be in the document.

I came up with the tokens [AND, OR, NOT, LPAREN, RPAREN, LSET, RSET, LSEQ, RSEQ, WRD]. So that the expression (w1 & {w2 w2}) would become [LPAREN WRD AND LSET WRD WRD RSET RPAREN].

But I'm having trouble with the grammar that makes it possible to be parsed. On my very first try I came up with:

S -> E

E -> T AND T | T OR T | T

T -> LSET W RSET | LSEQ W RSEQ | LPAREN E RPAREN |NOT E |WRD

W -> WRD* [don't really know how to write this formally, but this only accepts an array of WRD tokens until RPAREN or RSEQ depending on which one it started with.]

This obviously doesn't work at all because the expression isn't evaluated entirely (it stops after first return), the parenthesis are not correctly taken care of, among other problems. It has been a couple of days I can't seem to come up with something useful, pls help.

For the code I took inspiration in this but I think it doesn't really fit my problem.

code (I've tested the tokenizer and it works correctly):

public class BoolExprParser {

    private final String expression;
    private final BoolExprTokenizer tokenizer;
    private BoolExprTokenizer.Token currentToken;

    private void advance() {
        currentToken = tokenizer.getNext();
    }

    private boolean currentEquals(BoolExprTokenizer.Token t) {
        return currentToken == t;
    }

    private boolean parse(Document doc) {
        advance();
        boolean val = expr(doc);
        if (!currentEquals(BoolExprTokenizer.Token.END)) {
            // error
        }

        return val;
    }

    private boolean expr(Document doc) {
        boolean leftExpr = subExpr(doc);
        switch (currentToken) {
            case AND:
                advance();
                boolean rightExpr = subExpr(doc);
                return leftExpr && rightExpr;
            case OR:
                advance();
                rightExpr = subExpr(doc);
                return leftExpr || rightExpr;
            case END:
                return leftExpr;
            default:
                //error
        }

        return false;
    }

    private boolean subExpr(Document doc) {
        switch (currentToken) {
            case NOT:
                advance();
                boolean result = expr(doc);
                return !result;
            case WRD:
                advance();
                return doc.isWord(tokenizer.getWord());
            case LSet:
                advance();
                boolean wordsInsideSet = wordsSet(doc);
                if (!currentEquals(BoolExprTokenizer.Token.RSet)) {
                    // error
                } else {
                    advance();
                }
                return wordsInsideSet;
            case LP:
                advance();
                boolean exprInside = expr(doc);
                if (!currentEquals(BoolExprTokenizer.Token.RP)) {
                    // error
                } else {
                    advance();
                }
                return exprInside;
            default:
                // error
        }

        return false;
    }

    private boolean wordsSet(Document doc) {
        boolean validToken = currentEquals(BoolExprTokenizer.Token.WRD);
        boolean isInDoc = true;

        while (validToken) {
            if (isInDoc) isInDoc = doc.isWord(tokenizer.getWord());
            advance();
            validToken = currentEquals(BoolExprTokenizer.Token.WRD);
        }

        if (!currentEquals(BoolExprTokenizer.Token.RSet)) {
            // error
        }

        return isInDoc;
    }
}

`

Jafeth
  • 3
  • 4
  • I suspect that a parser is more technology than your project needs. If I understand your example correctly, `({w1 w2 w3} & !w4) | (w5 & "mark likes food")` translates to; get five words from some list. The document must either contain w1, w2, and w3 in any order and not w4 or w5 and the exact phrase `"mark likes food"`. If my understanding is correct, you just parse the `String` and put the words from some list into a `java.util.List` and taking one word or phrase at a time, search the document for the presence or absence of that word or phrase. – Gilbert Le Blanc Nov 06 '22 at 15:35
  • Thanks for the response. I have thought about this, but I don't see any other way. How would I be able to correctly evaluate very complex expressions in the correct order, following the right logic, nested parentheses, etc., if not with a parser? – Jafeth Nov 06 '22 at 16:22
  • You may not have covered this in class, but one way to handle complex expressions is with [Reverse Polish notation](https://en.wikipedia.org/wiki/Reverse_Polish_notation) or postfix notation. – Gilbert Le Blanc Nov 07 '22 at 00:33
  • Managed to implement it with a shunting yard, didn't know it existed before. You saved my ass buddy! – Jafeth Nov 08 '22 at 16:34

0 Answers0