Here is the link to the original paper, Denoising Diffusion Implicit Models, that I will be referring to. https://arxiv.org/abs/2010.02502
For a while, I am stuck at the definition and derivations of the accelerated sampling process denoted under the appendix section C.1 from the original DDIM paper (from Jiaming Song et al). In particular, it is the way how the "generative process" is defined in (55) that bugs me a lot. Here is the formula:
Definition of the accelerated generative process
From my understanding, $p_{\theta}(x_{0:T})$ aims to describe the joint distribution for the latents 1:T and the clean image $x_0$. The notations here use p and q to differentiate between ground truth density and predicted density. The "use to produce samples" equation is effectively the non-Markovian chain that is being used to accelerate the $x_0$ generation process, which would include a series of prediction of intermediate sequence of latents in $x_{\tau}$. My question is: How does the "variational objective" part take care of the rest of the latents $x_t$ for t in $\bar{\tau}$? The definition, to me, literally says, "the density of the clean images conditioned on $x_t$ for $t$ in $\bar{\tau}$", instead of the density of latents in $\bar{\tau}$. I noticed that the authors use the definition notation ":=", which means they have the freedom to formulate whatever they wish that best describes the generative process. Clearly, the "use to produce samples" part is the key idea behind the accelerated sampling process. The chain goes from $x_{\tau_{S}}$ to $x_{\tau_1}$ then to $x_0$, where $x_{\tau_{S}} = x_T$. I am looking for a better interpretation for the latter part of the product formula. A further question is to be asked regarding the variational objective derived in (58) and (59):
Derivation of variational objective, accelerated version
In (59), the first KLD term is effectively minimizing the two Gaussians, (1) ground truth density of latents for $x_{\bar{\tau}}$ (ie, latents not important for $x_0$ generation) given $x_0$, and (2) the model's prediction of density $x_0$ given these unimportant latents. This is confusing when I tried to interpret as follows, the model's prediction of clean images should match some noisy images from the latents!
I tried to compare the "unified inference process" described in (9) and (10),
Unified inference process
with the "accelerated generation process", which I have a problem with described above. The unified inference process leads to a variational objective that is almost identical to the one used in previous DDPM. In the DDIM paper, the authors introduce the $\sigma$ family for ground truths, yet it does not change the training process from that in DDPM. It turns out the model's parameters are independent to what $\sigma$ effectively does to the inference process.