How to deal with large alphabets in BNF?

Question

Given a language defined as:

Any pair of matching symbols is a valid string.

E.g. 00, 55, qq, YY

And a large alphabet of non-terminal symbols (let's say 4,294,967,296 of them)...

How would you define a BNF grammar to express the language? (Context-sensitive or otherwise.)

I'm specifically interested in learning if there's a way to do this without writing 4,294,967,296 rules: i.e. a grammar that's so large, it's lost all the benefits of being defined with BNF, as it has become a "brute force" set of valid literals.

score 2 · Accepted Answer · answered Aug 03 '20 at 16:29

Most uses of BNF are to describe context-free grammars.

You can certainly use the BNF notation for a non-context-free grammar; all you need to do is to put more than one terminal on the left-hand side. However, that's not generally very useful in practice, because non-context-free grammars do not provide intuitive descriptions of the structure of the language parsed, nor do they lead to an algorithm for parsing the language. And one would expect that any practical grammar formalism would either give human readers a good description, or allow the automatic generation of a parser, or both. (That doesn't make non-context-free grammars useless in the formal analysis of languages; in the mathematical theory, it is not necessary to please either the reader or the parser generator.)

But if we restrict ourselves to context-free grammars, we immediately hit a roadblock, because a context-free grammar cannot express duplication, such as { ωω | ω∈Σ^* }. Duplication is almost by definition not context-free, because context-free means that the expansion of a non-terminal cannot depend on the context in which the non-terminal appears. Hence, a rule which says "this non-terminal must have the same expansion as that non-terminal", which is required to express duplication, cannot be context-free.

Of course, the language { ωω | ω∈Σ }, which is what you are looking to describe, is context-free, but that's only because it is possible to enumerate all the possibilities (which must be a finite number, because we insist that the alphabet Σ be a finite set).

So where does that leave you?

Basically, you are free to invent whatever formalism suits your purposes, as long as you clearly define its meaning for the reader. That formalism may or may not lead to the possibility of automated parser generation, but if that is not your goal, that fact is irrelevant. Most EBNF dialects -- and there are a lot of them, practically none of which can actually generate a parser without assistance -- allow some way of embedding descriptions written in natural language for syntaxes which are difficult or impossible to describe with a context-free grammar. If you look through EBNF examples, you are likely to find a large constellation of different ways of saying " is any element of the character set" without actually exhaustively listing the entire character set, which given the existence of Unicode would be a ridiculous undertaking. (Although Unicode only has 17*2¹⁶ codepoints, which is a lot less than 2³². But it's still more than a million.)

Great answer, thanks rici. I shall write my own EBNF :) It would be great to know if a catalogue exists for all the popular EBNFs.. — Lawrence Wagerfield, Aug 03 '20 at 16:58
@LawrenceWagerfield: I don't think there is a "popular" EBNF. Every parser generator in existence has its own uniquely idiosyncratic (E)BNF, so you need to read the documentation for the parser generator to know how it works. (If you can't find the documentation for a parser generator, my advice is to use a different parser generator. Good software comes with good documentation.) The only EBNF formalism I know of which could be described as widely used is the one used by (some) RFCs, but different RFCs interpret its constructs in different ways, so that's not actually very helpful either. — rici, Aug 03 '20 at 19:46
The situation is somewhat analogous to "regular expressions". There is a mathematicall definition for "regular expression" which is great, if you're a mathematician. In the world of practical software, however, one wants something with a few more concessions to usability. Even (f)lex and Posix EREs, which do not make much attempt to go beyond the formal language theory definition, allow "character classes" which are a simplified way of writing membership in large alphabetic sets. (That doesn't extend the power of a regex; it just makes it more usable. Unlike, say, Perl extensions.) — rici, Aug 03 '20 at 19:49
Then everyone casually assumes that the regex syntax they are most familiar has some sort of official seal of approval. Well, there's the rub. There is no Global Authority of Programming Syntaxes. And if they were, we'd ignore their rulings anyway. So, a good programmer (1) learns how to be flexible with terminology, and (2) understands that a library is not finished until it is rigorously documented. — rici, Aug 03 '20 at 19:51
@LawrenceWagerfield: Also, someone just upvoted [this answer](https://cs.stackexchange.com/questions/122045/representing-but-not-in-formal-grammar/122052#122052) for some reason, jogging my memory. You might enjoy reading it as an overly complicated exploration of the formal meaning of what should be an absurdly simple phrase, used in a variety of slightly-different syntax formalisms for different languages. If nothing else, it demonstrates the absence of a full consensus, and the transaction costs created by that absence :-) — rici, Aug 03 '20 at 20:06
Thank you ever so much rici :) I am really looking for a library/EBNF that allows computation (in ANTLR these are called "actions" I believe), but also guarantees that for any single grammar, both a parser _and_ a generator can be generated. This is most likely asking for too much (and most likely I won't be able to muster it up myself!). — Lawrence Wagerfield, Aug 03 '20 at 21:38
@Lawrence: ANTLR is certainly an alternative, as long as you're prepared to use it's style for specifying lexical analysis. But you're not going to find a way to express the condition "the same character repeated" unless you write some code to do it. (And I have a feeling that it won't be as simple as it should be, but I'm far from an ANTLR expert.) I could write it in Flex, but it would be ugly there, too :-) — rici, Aug 03 '20 at 23:32

How to deal with large alphabets in BNF?

1 Answers1