I'm looking for a way to find out whether a specific rule in a BNF grammar can be converted to a regular expression.
(With "regular expression" (RE), I mean the simple mathematical kind. I'm not interested in BNF rules that can only be done with the use of backreferences, lookarounds, or other advanced features.)
I'm only interested in cases where it is possible.
I know that this problem is generally undecidable, so I'm basically looking for tricks to do it anyway. Something semi-decidable would be nice.
My current approach is based on the idea that all non-recursive rules (rules that don't refer to themselves and don't contain rules that refer to themselves) can be easily converted to a RE. So "all I have to do" is to rewrite the recursive rules. Simple example:
S = a | b S
= b* a
T = a | T b T | T c T
= a | T (b|c) T
= a ( (b|c) a )*
However, this approach is limited by my ability to recognize patterns in a BNF AST and to simply said AST. It's a very limited approach so I'm looking for better ways.
Here's an example of what the solution has to be able to handle:
S = a | c | S (b S)* c | S d S | S e S ( e S )*
The language of the above rule is regular. However, showing this is not easy and takes time.
Proof sketch:
S = a | c | S (b S)* c | S d S | S e S ( e S )*
= a | c | S (b S)* c | S d S | S e S
= a | c | S (b S)* c | S (d|e) S
= a | c | S c | S b S (b S)* c | S (d|e) S
For now, let's ignore the S b S (b S)* c
alternative:
S' = a|c | S' c | S' (d|e) S'
= (a|c)c* ( (d|e) (a|c)c* )*
Back to the S b S (b S)* c
alternative: It basically says that if the input contains a b
, then somewhere after the b
, there must a (a|c)c
. This is hard to express in RE but is easy to do with an NFA.
Construct 2 NFAs x and y such that x = S'
and y = S' (b S')* c
. Whenever we are at a final state in x, transition via b
to the initial state of y. Whenever we at a final state in y, transition via epsilon to all final states of x. The final NFA will have both the initial and final states of x. The RE of the final NFA is: (a|c) ( c | (d|e)(a|c) | b(a|c) ( (b|d|e)(a|c) )* c )*