There have been similar questions, like this or this, but mine is different:
Let's say, I have a probabilistic context-free grammar (PCFG)
S --> A [1/2] | B [1/2]
A --> eps [p] | AA [q] | x [r]
B --> y [1]
where eps is an empty string, p + q + r = 1 (and q <= 1/2, so that generating process finishes with probability 1).
- What is Chomsky normal form for this PCFG?
I found it very difficult to get rid of the null production rule A --> eps
and at the same time keep all the probabilities of generating a given string intact. For example, a := P(eps) = (1 - sqrt(1 - 4 p q)) / (2 q)
and P(x) = r / (1 - 2 a q)
(as well as the others) should not be changed after creating a CNF(PCFG).
- Is there a source for an algorithm that does a conversion PCFG --> CNF(PCFG) for any PCFG? (or proves this is not possible)
Searching for this, I found numerous sources, claiming that this is possible, however, I saw no proof for this. Also, following the procedure for CNF(CFG) (where no probability is assigned to rules) does not work (or at least I do not see how one could generalize this to any PCFG).
EDIT: This pdf (page 112) claims that It turns out that every epsilon-free PCFG G has a corresponding binarized PCFG G′ that generates the same language as G.
Binarized form of PCFG-s is slightly less restrictive than CFN. Again, the pdf provides no sources / proofs for this claim.
Epsilon free means that there are not rules X --> eps
(which is not true for the toy grammar above).