3

I'm looking for a way to find out whether a specific rule in a BNF grammar can be converted to a regular expression.

(With "regular expression" (RE), I mean the simple mathematical kind. I'm not interested in BNF rules that can only be done with the use of backreferences, lookarounds, or other advanced features.)

I'm only interested in cases where it is possible.

I know that this problem is generally undecidable, so I'm basically looking for tricks to do it anyway. Something semi-decidable would be nice.


My current approach is based on the idea that all non-recursive rules (rules that don't refer to themselves and don't contain rules that refer to themselves) can be easily converted to a RE. So "all I have to do" is to rewrite the recursive rules. Simple example:

S = a | b S
  = b* a

T = a | T b T | T c T
  = a | T (b|c) T
  = a ( (b|c) a )*

However, this approach is limited by my ability to recognize patterns in a BNF AST and to simply said AST. It's a very limited approach so I'm looking for better ways.


Here's an example of what the solution has to be able to handle:

S = a | c | S (b S)* c | S d S | S e S ( e S )*

The language of the above rule is regular. However, showing this is not easy and takes time.

Proof sketch:

S = a | c | S (b S)* c | S d S | S e S ( e S )*
  = a | c | S (b S)* c | S d S | S e S
  = a | c | S (b S)* c | S (d|e) S
  = a | c | S c | S b S (b S)* c | S (d|e) S

For now, let's ignore the S b S (b S)* c alternative:

S' = a|c | S' c | S' (d|e) S' 
   = (a|c)c* ( (d|e) (a|c)c* )*

Back to the S b S (b S)* c alternative: It basically says that if the input contains a b, then somewhere after the b, there must a (a|c)c. This is hard to express in RE but is easy to do with an NFA.

Construct 2 NFAs x and y such that x = S' and y = S' (b S')* c. Whenever we are at a final state in x, transition via b to the initial state of y. Whenever we at a final state in y, transition via epsilon to all final states of x. The final NFA will have both the initial and final states of x. The RE of the final NFA is: (a|c) ( c | (d|e)(a|c) | b(a|c) ( (b|d|e)(a|c) )* c )*

Michael Schmidt
  • 110
  • 1
  • 14
  • I don't think you'll find anything which goes much further than recognising (possibly indirect) linear rules. Although it seems like a limited technique, it does cover a lot of real-world cases. That's effectively how Antlr handles left-recursion, for example, and the anecdotal evidence is that it works out pretty well. – rici Aug 31 '20 at 01:09
  • Unfortunately, a lot of the rules I deal with are non-linear but still regular. I can rewrite some of the non-linear rules to be linear but it's very limited. Until now, I converted the remaining rules by hand but the pen-and-paper approach is quite error-prone and time-consuming. – Michael Schmidt Aug 31 '20 at 12:51
  • A concrete example might be useful. Although it might only be to assuage my curiosity, because I don't know any good rewriting techniques. – rici Aug 31 '20 at 12:53
  • The most recent one: `S = a | c | S (b S)* c | S d S | S e S ( e S )*` – Michael Schmidt Aug 31 '20 at 14:21
  • That's not a regular language. I guess your original question was not really about regular languages, but rather about creating some kind of EBNF from a BNF source. That's no easier to solve, though it's certainly more applicable. One issue with the transformation is that it loses information from the original grammar. Simple example: `A = a | A a` and `A = a | a A` both correspond to `A = a+`. But the parse trees in the original grammar are left- and right-leaning, respectively. That indication is erased from the EBNF. But sometimes it matters. – rici Aug 31 '20 at 16:13
  • The language is regular. Proof sketch (part 1): `S = a | c | S (b S)* c | S d S | S e S ( e S )* = a | c | S (b S)* c | S d S | S e S = a | c | S (b S)* c | S (d|e) S = a | c | S c | S b S (b S)* c | S (d|e) S`; For now, let's ignore `S b S (b S)* c`: `S' = a|c | S' c | S' (d|e) S' = (a|c)c* ( (d|e) (a|c)c* )*`; Back to the `S b S (b S)* c` alternative: This basically says that if the input contains a `b`, then somewhere after the `b`, there must a `(a|c)c`. This is hard to express in RE but is easy to do with an NFA. – Michael Schmidt Aug 31 '20 at 22:15
  • (part 2) Construct 2 NFAs x and y such that `x = S'` and `y = S' (b S')* c`. Whenever we are at a final state in x, transition via `b` to the initial state of y. Whenever we at a final state in y, transition via epsilon to all final states state of x. The final NFA will have both the initial and final states of x. The RE of the final NFA is: `(a|c) ( c | (d|e)(a|c) | b(a|c) ( (b|d|e)(a|c) )* c )*`. – Michael Schmidt Aug 31 '20 at 22:16
  • This is why I want an algorithm. It's not easy and the above example only took me like 30 minutes. – Michael Schmidt Aug 31 '20 at 22:17
  • 1
    Ok, it is regular. I understand why you want an algorithm :-) I suspect you'll find more theoreticians on [cs.se] or even [cstheory.se]; I'd suggest incorporating that example into the question in order to give context. If anything useful occurs to me, I'll post it. Good luck. – rici Sep 01 '20 at 00:26
  • Thank you for the suggestion. I will post it there as well. – Michael Schmidt Sep 01 '20 at 10:49

0 Answers0