2

I have an app that includes a 3 operator (& | !) boolean expression evaluator, with variables and constants. Generally the expressions aren't too long (perhaps 50 terms at the most, but usually a lot less). There can be very many expressions - I'm expecting the upper limit to be around a million. Currently I have a hand written parser with a very simple evaluator that simply recursively traverses the parse tree. One constraint is that this has to be callable from C++. I have no sharing between expressions. I'd like to investigate speeding this up.

I see two avenues of research.

  1. Add sharing and store the state indicating whether an expression node has been evaluated or not.
  2. Extract Common Subexpressions.

Also I would expect that a code generation approach will be faster than an interpretive approach working on parse trees or similar structures. It would probably be fairly straightforward to generate some C++ code, but considering the length of the functions, I don't know if a compiler like GCC will be able to optimize the CSEs.

I've seen that there are a few libraries available for expression evaluation, but in my work environment adding 3rd party libraries is not simple plus they all seem very complicated compared to my needs.

Lastly I've been looking at Antlr4 a bit recently, so that might be appropriate for me. In the past I've worked on C code generation, but I have no experience of using something like LLVM for optimisation and code generation.

Any suggestions for which way to go?

Paul Floyd
  • 5,530
  • 5
  • 29
  • 43

2 Answers2

2

As far as I understood, your question is more about faster expression evaluation than it about faster expression parsing. So my answer will focus on the former. Parsing, after all, should not be the bottleneck as your expression language looks simple enough to implement a manually tuned parser for it.

So, to accelerate your evaluations, you can consider JIT execution of your formulas using LLVM. That is, given your formula F you can (relatively) easily generate corresponding LLVM IR and directly evaluate it. This SMT solver does just that. IR code generation is implemented in a single C++ class here. Note that the boolean expressions you mentioned are a subset of the SMT language supported by that solver. Additionally, you can easily adjust how aggressive the LLVM optimizer needs to be.

However, IR generation and optimization has its overhead. Therefore, in case a given formula is not evaluated often enough to amortize the initial overhead, then I would recommend direct interpretation instead. You can look in this case for opportunities to find structural similarities and common subexpressions.

Codoka
  • 836
  • 10
  • 11
1

As much as I'd like to suggest ANTLR4, I fear it won't meet your performance needs. There is a lot going on under the hood with its adaptive LL(*) algoritms and though there are some common tricks to improve its performance, simply tracing an ANTLR4 interpreter at runtime suggests that unless your current expression evaluator is very inefficient, it is likely faster than ANTLR4, which is an industrial-duty engine meant to support grammars far more complicated than yours. I use ANTLR when a LALR(1) DFA shift-reduce engine won't support my grammar, and take the performance hit in return for the extra parsing power of ANTLR4.

TomServo
  • 7,248
  • 5
  • 30
  • 47
  • The runtime of parse and lex shouldn't be an issue. It's a separate phase from the evaluation which is done repeatedly, potentially with a runtime of several days. So a parse lex phase of even several hours would be no problem. The evaluation speed is key. – Paul Floyd May 26 '17 at 05:28
  • Ah I see. Misunderstood your question then. Evaluation in ANTLR can be quite efficient. There is a grammar called Mu that I've used as the basis for a runtime expression interpreter. [link](https://github.com/bkiers/Mu/tree/master/src/main/antlr4/mu). – TomServo May 27 '17 at 01:45