Is it appropriate for a parser DCG to not be deterministic?

Question

I am writing a parser for a query engine. My parser DCG query is not deterministic.

I will be using the parser in a relational manner, to both check and synthesize queries.

In code:

If I want to be able to use query/2 both ways, does it require that

?- phrase(query, [q,u,e,r,y]).
true;
false.

or should I be able to obtain

?- phrase(query, [q,u,e,r,y]).
true.

nevertheless, given that the first snippet would require me to use it as such

?- bagof(X, phrase(query, [q,u,e,r,y]), [true]).
true.

when using it to check a formula?

Your `bagof/3` trick would check that you get exactly one result, in that it will fail if you get two. If you want to check that a parse succeeds at least once, `once(phrase(query, [q,u,e,r,y]))` would be less opaque. (I would be tempted to use `bagof(_, phrase(query, [q,u,e,r,y]), [_])` instead anyway, since the `true` might as well be `eels` and you'll get a singleton warning on the `X`.) — Daniel Lyons, Jul 31 '19 at 22:23
In practice, it is quite difficult to write code that efficiently parses and generates. Not impossible but just too much work? So basically, write a deterministic parsing code path and a deterministic generating code path. You won't get as many style points, that's true. — User9213, Aug 01 '19 at 03:32
Of interest: [This](https://swi-prolog.discourse.group/t/controlling-backtracking-in-a-dcg-grammar/938/14) talks about taking a typical DCG parser and enhancing it so that it can still parse and also work as a generator. A typical problem with generators is that because Prolog is [DFS](https://en.wikipedia.org/wiki/Depth-first_search) by default, with recursive DCG rules you run out of stack space before an answer is returned. So a guard needs to be introduced to limit the depth. Another way to solve the problem is to use `length/2` as noted in the referenced example. — Guy Coder, Aug 01 '19 at 13:18

Guy Coder · Accepted Answer · 2019-08-01T11:02:49.323

The first question to ask yourself, is your grammar deterministic, or in the terminology of grammars, unambiguous. This is not asking if your DCG is deterministic, but if the grammar is unambiguous. That can be answered with basic parsing concepts, no use of DCG is needed to answer that question. In other words, is there only one way to parse a valid input. The standard book for this is "Compilers : principles, techniques, & tools" (WorldCat)

Now you are actually asking about three different uses for parsing.

A recognizer.
A parser.
A generator.

If your grammar is unambiguous then

For a recognizer the answer should only be true for valid input that can be parsed and false for invalid input.
For the parser it should be deterministic as there is only one way to parse the input. The difference between a parser and an recognizer is that a recognizer only returns true or false and a parser will return something more, typically an abstract syntax tree.
For the generator, it should be semi-deterministic so that it can generate multiple results.

Can all of this be done with one, DCG, yes. The three different ways are dependent upon how you use the input and output of the DCG.

Here is an example with a very simple grammar.

The grammar is just an infix binary expression with one operator and two possible operands. The operator is (+) and the operands are either (1) or (2).

expr(expr(Operand_1,Operator,Operand_2)) -->
    operand(Operand_1),
    operator(Operator),
    operand(Operand_2).

operand(operand(1)) --> "1".
operand(operand(2)) --> "2".

operator(operator(+)) --> "+".

recognizer(Input) :-
    string_codes(Input,Codes),
    DCG = expr(_),
    phrase(DCG,Codes,[]).

parser(Input,Ast) :-
    string_codes(Input,Codes),
    DCG = expr(Ast),
    phrase(DCG,Codes,[]).

generator(Generated) :-
    DCG = expr(_),
    phrase(DCG,Codes,[]),
    string_codes(Generated,Codes).

:- begin_tests(expr).

recognizer_test_case_success("1+1").
recognizer_test_case_success("1+2").
recognizer_test_case_success("2+1").
recognizer_test_case_success("2+2").

test(recognizer,[ forall(recognizer_test_case_success(Input)) ] ) :-
    recognizer(Input).

recognizer_test_case_fail("2+3").

test(recognizer,[ forall(recognizer_test_case_fail(Input)), fail ] ) :-
    recognizer(Input).

parser_test_case_success("1+1",expr(operand(1),operator(+),operand(1))).
parser_test_case_success("1+2",expr(operand(1),operator(+),operand(2))).
parser_test_case_success("2+1",expr(operand(2),operator(+),operand(1))).
parser_test_case_success("2+2",expr(operand(2),operator(+),operand(2))).

test(parser,[ forall(parser_test_case_success(Input,Expected_ast)) ] ) :-
    parser(Input,Ast),
    assertion( Ast == Expected_ast).

parser_test_case_fail("2+3").

test(parser,[ forall(parser_test_case_fail(Input)), fail ] ) :-
    parser(Input,_).

test(generator,all(Generated == ["1+1","1+2","2+1","2+2"]) ) :-
    generator(Generated).

:- end_tests(expr).

The grammar is unambiguous and has only 4 valid strings which are all unique.

The recognizer is deterministic and only returns true or false.
The parser is deterministic and returns a unique AST.
The generator is semi-deterministic and returns all 4 valid unique strings.

Example run of the test cases.

?- run_tests.
% PL-Unit: expr ........... done
% All 11 tests passed
true.

To expand a little on the comment by Daniel

As Daniel notes

1 + 2 + 3

can be parsed as

(1 + 2) + 3

or

1 + (2 + 3)

So 1+2+3 is an example as you said is specified by a recursive DCG and as I noted a common way out of the problem is to use parenthesizes to start a new context. What is meant by starting a new context is that it is like getting a new clean slate to start over again. If you are creating an AST, you just put the new context, items in between the parenthesizes, as a new subtree at the current node.

With regards to write_canonical/1, this is also helpful but be aware of left and right associativity of operators. See Associative property

e.g.

+ is left associative

?- write_canonical(1+2+3).
+(+(1,2),3)
true.

^ is right associative

?- write_canonical(2^3^4).
^(2,^(3,4))
true.

i.e.

2^3^4 = 2^(3^4) = 2^81 = 2417851639229258349412352

2^3^4 != (2^3)^4 = 8^4 = 4096

The point of this added info is to warn you that grammar design is full of hidden pitfalls and if you have not had a rigorous class in it and done some of it you could easily create a grammar that looks great and works great and then years latter is found to have a serious problem. While Python was not ambiguous AFAIK, it did have grammar issues, it had enough issues that when Python 3 was created, many of the issues were fixed. So Python 3 is not backward compatible with Python 2 (differences). Yes they have made changes and libraries to make it easier to use Python 2 code with Python 3, but the point is that the grammar could have used a bit more analysis when designed.

Thanks for the nice answer. A question, though: for the grammar to be deterministic, must the set of valid inputs be bounded? Mine is unbounded because the input string has no length bounds and is specified by a recursive DCG. — Raoul, Aug 01 '19 at 00:04
The real word I should have used instead of `deterministic grammar` is `unambiguous grammar` see [Ambiguous grammar](https://en.wikipedia.org/wiki/Ambiguous_grammar), but the idea is the same; there should be only one way to parse an input. So the questions you are asking is, `is my grammar unambiguous?` I don't know and can't say. Also to check a grammar for ambiguities takes a long time to do correctly. I have never done it for a large grammar, but after a while of writing parsers you tend to get a feeling when something will be ambiguous. — Guy Coder, Aug 01 '19 at 00:16
What you said does not have me concerned because there are grammars that allow recursive expressions that are unambiguous, but many use parenthesizes to start a new context. Also just because the grammar is ambiguous doesn't mean all is lost, there are also ways to solve the problem so that you wind up with a valid AST if that is the goal. C++ has this problem. Any way, what you are starting to ask it getting beyond the scope of the original question and really would need an entire course just to answer without a specific grammar. — Guy Coder, Aug 01 '19 at 00:22
@Raoul What makes a grammar ambiguous is if there are multiple valid parsings of the same input. `1 + 2 + 3` can be parsed as (1 + 2) + 3 or 1 + (2 + 3), for instance. Prolog resolves the potential ambiguity by _always_ producing the first tree (look with `write_canonical/1` to see it). But note that `1 + 2 + 3` is _not_ ambiguous because ambiguity is a property of _the grammar_, not the parser, the input text, or the implementation of the grammar. It has nothing to do with the cardinality of the set. — Daniel Lyons, Aug 01 '19 at 04:32
Of interest: [Viral math problem baffles mathematicians, physicists](https://nypost.com/2019/08/01/viral-math-problem-baffles-mathematicians-physicists/) — Guy Coder, Aug 01 '19 at 12:20

User9213 · Answer 2 · 2019-08-01T05:10:39.737

The only reason why code should be non-deterministic is that your question has multiple answers. In that case, you'd of course want your query to have multiple solutions. Even then, however, you'd like it to not leave a choice point after the last solution, if at all possible.

Here is what I mean:

"What is the smaller of two numbers?"

min_a(A, B, B) :- B < A.
min_a(A, B, A) :- A =< B.

So now you ask, "what is the smaller of 1 and 2" and the answer you expect is "1":

?- min_a(1, 2, Min).
Min = 1.

?- min_a(2, 1, Min).
Min = 1 ; % crap...
false.

?- min_a(2, 1, 2).
false.

?- min_a(2, 1, 1).
true ; % crap...
false.

So that's not bad code but I think it's still crap. This is why, for the smaller of two numbers, you'd use something like the min() function in SWI-Prolog.

Similarly, say you want to ask, "What are the even numbers between 1 and 10"; you write the query:

?- between(1, 10, X), X rem 2 =:= 0.
X = 2 ;
X = 4 ;
X = 6 ;
X = 8 ;
X = 10.

... and that's fine, but if you then ask for the numbers that are multiple of 3, you get:

?- between(1, 10, X), X rem 3 =:= 0.
X = 3 ;
X = 6 ;
X = 9 ;
false. % crap...

The "low-hanging fruit" are the cases where you as a programmer would see that there cannot be non-determinism, but for some reason your Prolog is not able to deduce that from the code you wrote. In most cases, you can do something about it.

On to your actual question. If you can, write your code so that there is non-determinism only if there are multiple answers to the question you'll be asking. When you use a DCG for both parsing and generating, this sometimes means you end up with two code paths. It feels clumsy but it is easier to write, to read, to understand, and probably to make efficient. As a word of caution, take a look at this question. I can't know that for sure, but the problems that OP is running into are almost certainly caused by unnecessary non-determinism. What probably happens with larger inputs is that a lot of choice points are left behind, there is a lot of memory that cannot be reclaimed, a lot of processing time going into book keeping, huge solution trees being traversed only to get (as expected) no solutions.... you get the point.

For examples of what I mean, you can take a look at the implementation of library(dcg/basics) in SWI-Prolog. Pay attention to several things:

The documentation is very explicit about what is deterministic, what isn't, and how non-determinism is supposed to be useful to the client code;
The use of cuts, where necessary, to get rid of choice points that are useless;
The implementation of number//1 (towards the bottom) that can "generate extract a number".

(Hint: use the primitives in this library when you write your own parser!)

I hope you find this unnecessarily long answer useful.

Is it appropriate for a parser DCG to not be deterministic?

In code:

2 Answers2