How do I convert a PCRE to EBNF?

Question

I'm writing a spec thingamajig and I don't know EBNF. I have the following PCRE:

^(?:\$(?:\$|{\d+})|[^$])*$

Where, in the input:

$$ is an escaped $.
${num} is an argument number.
Everything else (that is not a $) is a literal.
A $ not followed by $ or {num} is an error.

And I need to convert it to EBNF. How do I convert this PCRE to EBNF?

(I noticed there are many questions about going from EBNF to PCRE, but I haven't seen any about going the other way around)

Do you intend to let the `.` in that regex match a `$`? (So that the string `$a$42` is acceptable?) (Because if so, that is *lexically* equivalent to `.*`) — rici, Jul 02 '16 at 00:49
@rici I could do that, or I could error. For this specific use-case, `.` should be `.`, not `[^$]`. — SoniEx2, Jul 02 '16 at 01:32
If you really mean `.`, then the syntax (but not the semantic structure) is precisely described by `.*` , which in EBNF would be something like `{ any character }`. But you probably actually want to describe the semantic parsing, in which case you will need to jump through some hoops to create an unambiguous grammar. Or you could just reject loose dollar signs as errors :) — rici, Jul 02 '16 at 02:25
@rici EBNF can be used to describe ASTs, which's what I'm after. `$$` is escaped `$`, `${number}` is argument number, everything else is literal. — SoniEx2, Jul 02 '16 at 02:30
Yes, sort of. Technically, you would need an unambiguous grammar, but it is easier to use an EBNF comment to indicate that `any character` would include only those instances of `$` which are not followed by `$` or `{`. Iirc there is formal way of inserting contextual predicates in EBNF but I'll check the standard before answering. Which means I'll need to be at my computer, so it's not going to be immediate. — rici, Jul 02 '16 at 03:35
@rici I see. Thanks. Perhaps I should indeed consider it an error. — SoniEx2, Jul 02 '16 at 13:48

score 1 · Answer 1 · edited Oct 07 '21 at 11:21

Two things make answering this apparently simple question complicated:

The term "EBNF" has a large variety of manifestations. There is ISO standard ISO/IEC 14977:1996 for "Extended BNF" but as far as I know, it is rarely used in practice. (Note: There is a free download link on that page; purchase is not necessary.) Many internet protocols use "Augmented BNF" as defined by RFC 5234, which is probably a better fit for your particular problem. And there are lots of parser generators which extend BNF in various ways, generally by adding regular-expression-like repetition and optionality operators, without being standardized in any way. (In fact, it was this chaos of possible definitions that motivated the ISO to produce a standard, but as is often the case with ISO standards, the lack of free text access -- until a decade after its release -- and freely-available tools hindered adoption.)
Regular expressions do not necessarily produce unambiguous grammars, and the regular expression you provide is ambiguous, since $ is allowed to be used as an ordinary character. The implication (and, I'm certain, the intention) is that a $ may not be treated as a regular character if it is followed by another $ or a number surrounded by braces, but the regular expression itself does not (and does not need to) make that distinction. Less obvious is what the intention might be for a string like:
```
 ${42 looks like an error to me but it would be accepted by the regex.
```

Anyway, here is an ISO EBNF for something similar to your language. Note that it does not accept the above string.

(* EBNF does not have wildcard characters and there is no way to 
   enumerate all possible characters, so I use the exception mechanism
   to describe the set
 *)
any character
  = ? Any character representable by the source character encoding ? ;
decimal digit
  = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';
literal sequence
  = {any character} - 
      ({any character}, ('$$' | '${'), {any character}) ;
escaped dollar
  = '$$' ;
parameter
  = '${', decimal digit, {decimal digit}, '}';
thingamajig
  = {literal sequence | escaped dollar | parameter}

On the whole, since you provide a mechanism for escaping dollar signs, it would probably be simpler to just ban the use of loose dollar signs. That makes both the specification and the parser simpler, and avoids the problem of non-canonical representations. (Non-canonical representations can be a security problem because round-tripping a string into an internal representation and back could then fail fingerprint validation, and because they allow for information leaking. Those may not be significant in this case, but in general best practice for data exchange protocols is to avoid non-canonical representations when possible.)

How do I convert a PCRE to EBNF?

1 Answers1