0

There have been similar questions, like this or this, but mine is different:

Let's say, I have a probabilistic context-free grammar (PCFG)

S --> A [1/2] | B [1/2]
A --> eps [p] | AA [q] | x [r]
B --> y [1]

where eps is an empty string, p + q + r = 1 (and q <= 1/2, so that generating process finishes with probability 1).

  1. What is Chomsky normal form for this PCFG?

I found it very difficult to get rid of the null production rule A --> eps and at the same time keep all the probabilities of generating a given string intact. For example, a := P(eps) = (1 - sqrt(1 - 4 p q)) / (2 q) and P(x) = r / (1 - 2 a q) (as well as the others) should not be changed after creating a CNF(PCFG).

  1. Is there a source for an algorithm that does a conversion PCFG --> CNF(PCFG) for any PCFG? (or proves this is not possible)

Searching for this, I found numerous sources, claiming that this is possible, however, I saw no proof for this. Also, following the procedure for CNF(CFG) (where no probability is assigned to rules) does not work (or at least I do not see how one could generalize this to any PCFG).


EDIT: This pdf (page 112) claims that It turns out that every epsilon-free PCFG G has a corresponding binarized PCFG G′ that generates the same language as G.

Binarized form of PCFG-s is slightly less restrictive than CFN. Again, the pdf provides no sources / proofs for this claim.

Epsilon free means that there are not rules X --> eps (which is not true for the toy grammar above).

Antoine
  • 862
  • 7
  • 22

1 Answers1

0

The answer is much harder than one might think: this paper contains a 50-page proof of an algorithm that does this.

Antoine
  • 862
  • 7
  • 22