1

I am learning about context-free grammar and I would like to know how (if at all) it is possible to design a language that avoids repetition.

Let's take the select statement from SQL as an example:

possible: 
SELECT * FROM table
SELECT * FROM table WHERE x > 5
SELECT * FROM table WHERE x > 5 ORDER desc
SELECT * FROM table WHERE x > 5 ORDER desc LIMIT 5

impossible (multiple conflicting statements): 
SELECT * FROM table WHERE X > 5 WHERE X > 5

Grammar could look something like this:

S -> SW | SO | SL | "SELECT statement"
W -> "WHERE statement"
O -> "ORDER statement" 
L -> "Limit statement"

This grammar would allow for an impossible statement like the one mentioned above. How could I design a context-free grammar that avoids an impossible statement, while still being flexible?

Flexible:

The order of W, O, L does not matter. It also does not matter how many of these sub-statements are present. I would like to avoid a grammar that just lists all possible combinations since this would get quite messy if there are many possibilities.

User12547645
  • 6,955
  • 3
  • 38
  • 69

2 Answers2

3

In a context-free grammar, the set of sentences generated by a non-terminal is the same for every use of the non-terminal. That's what context-free means. A particular non-terminal, S, cannot sometimes allow a match and other times disallow it. So every set of possible matches must have its own non-terminal, and in the case of restricting a list of k cases to sentences without repeated cases, a minimum of 2k different non-terminals would be required, one for every subset of the k cases.

Worse, if the repetition you're trying to restrict has an unlimited number of possibilities (for example, you want to allow more than one W clause but not allow two identical Ws), then it cannot be done with a context-free grammar at all. The same is true if you want to insist on such repetition, which is basically what you would need to do to make a context-free grammar insist that variables be declared before use.

However, it is easy to do the check in a semantic action, for example by keeping a bit vector of clauses you have encountered (or a hash-set if it is not easy to enumerate the possible clauses). Then the semantic action for adding a clause to the statement only needs to check whether that particular clause has already been added, and flag an error if it has. That will also allow for better error messages since you can easily describe the problem when you detect it, as opposed to just st reporting a "syntax" error and leaving the user to guess what the problem was.

rici
  • 234,347
  • 28
  • 237
  • 341
  • Thank you very much for your answer. So, if I would want to describe the above problem using context-free grammar, the only option I have would be to use something like `S -> W | WO | WOL | WL | WLO | "SELECT statement"`, right? So I would have to list all combinations of statements. Of cause just throwing a syntax error is the better option! Thank you very much for that hint. – User12547645 Apr 07 '19 at 09:46
  • 1
    Yes, the only solution with a pure cfg is to enumerate all valid possibilities. – rici Apr 07 '19 at 17:23
  • In general this isn't true though, right? I can write down CFGs that either don't allow or that require repetition in the strings. It might not be possible in specific cases, but my goodness, regular expressions can enforce or prohibit repetition to an extent. – Patrick87 Apr 08 '19 at 13:08
  • @patrick87: Yes, you're right. "To an extent" it's possible. So I rewrote the answer in an attempt to be more precise. I think when I wrote the original, I had assumed that OP was trying to avoid precise duplication. But the permutation problem also cannot be solved without context-sensitivity except by using an exponential number of non-terminals, which is impractical unless there are very few options. – rici Apr 08 '19 at 15:16
0

I am not sure I am understanding your problem based on the grammar. Perhaps you mean for statement and S to be the same symbol. If that's the case, I would argue that your grammar is simply not right for the language you intend to describe. If we ignore ORDER and LIMIT then your grammar is

S -> SW | "SELECT S" | foo
W -> "WHERE S"

Then yes, you can derive nonsense like

S -> SW -> SWW -> SWWW -> "SELECT foo WHERE foo WHERE foo WHERE foo"

But this is just your first attempt at a grammar, this does not prove there is no grammar that works. Consider this:

<S> -> <A><B>
<A> -> SELECT <C>
<B> -> epsilon | WHERE <D>
<C> -> (rules for select lists)
<D> -> (rules for WHERE condition)

The rules for <C> and <D> can refer back to S and A and B, properly, perhaps using parentheses, as required to produce strings that work for you. No longer can you get the bad strings.

This is not really a problem that CFGs cannot overcome by themselves. To do things like enforce that only declared variables can be used, yes, context-sensitive or better machinery is needed, but we are just talking about repeating keywords and phrases. This is well within the bounds of what CFGs can do. Now, if you want to support aliases and enforce correct alias referencing in the query, that is impossible in context-free languages. But that's not what we're discussing here. The reason it's impossible is that the language L = {ww | w in E*} is not a context-free language, and that's essentially what is involved in enforcing variable names or table aliases.

Patrick87
  • 27,682
  • 3
  • 38
  • 73