Skip to content

Deep Generative Model: VAE and Diffusion

Published:

Table of Contents

Open Table of Contents

Problem Formulation

For a given set of images xx, we want the generative model to find the underlying distribution p(x)p(x) of all images. Then we sample from p(x)p(x) to get a new image xx^{\prime} that is unseen in the training datasets, but conform to the “overall nature” of the trained images. So the whole generation process can be thought of as mapping a distribution and then sampling from the distribution.

Conditional generation aims to map a distribution for latent feature p(z)p(z) to the image distribution p(x)p(x). The model itself can be think of as a function p(xz)p(x|z). Another question is: how to model the resemblance between distribution. One way is to use KL divergence:

DKL(pq)=xp(x)logp(x)q(x)dx\mathcal{D}_{\mathrm{KL}}(p\Vert q) = \int_xp(x) \log\frac{p(x)}{q(x)}\text{d}x

And we can simplify the KL divergence into a “likelihood” term with simple math:

DKL(pdatapθ)=xpdata(x)logpdata(x)pθ(x)=xpdata(x)logpθ(x)+C=Expdatalogpθ(x)+C\begin{aligned} \mathcal{D}_{\mathrm{KL}} (p_{\mathrm{data}}\Vert p_{\theta}) &= \sum_x p_{\mathrm{data}}(x) \log\frac{p_{\mathrm{data}}(x)}{p_{\theta}(x)} \\ &= \sum_x -p_{\mathrm{data}}(x)\log{p_{\theta}(x)} + C \\&= -\mathbb{E}_{x\sim p_{\mathrm{data}}}\log{p_\theta(x)} + C\end{aligned}

We want to minimize DKL(pdatapθ)\mathcal{D}_{\mathrm{KL}} (p_{\mathrm{data}}\Vert p_{\theta}) in parameter space θ\theta, which is equivalent to maximize Expdatalogpθ(x)\mathbb{E}_{x\sim p_{\mathrm{data}}}\log{p_\theta(x)}. We can interoperate the term as “maximum likelihood”, which has an intuitive explanation of maximizing the possibility of observing the ground-truth data from latent distribution.

Variational Autoencoder

It is very hard to directly model the distribution for image, and even if we can model such a distribution directly with a neural network, we have no idea how to sample from it. Variational autoencoder model the problem by introducing a latent space. If we impose the latent space q(z)q(z) to be a fixed distribution (e.g. normal distribution) and somehow find the mapping from this latent distribution to image distribution, we only need to sample from this simple distribution and map the sample to image space to get the final generated image. We can think of the mapping as a function to be optimized pθ(xz)p_\theta(x| z).

Evidence Lower Bound

So now we can rephrase our objective function and see how can we optimize the mapping.

logpθ(x)=q(z)logpθ(x)dz(Law of Total Probability)=q(z)logpθ(xz)p(z)pθ(zx)dz (Baye’s Rule)=q(z)logpθ(xz)p(z)pθ(zx)q(z)q(z)dz=Ezq(z)[logpθ(xz)]DKL(q(z)p(z))+DKL(q(z)pθ(zx))\begin{aligned} \log{p_\theta(x)} &= \int q(z)\log p_{\theta}(x)\text{d}z \quad \quad \quad \quad\quad\quad\quad \text{\small{(Law of Total Probability)}}\\ &= \int q(z)\log \frac{p_{\theta}(x|z) p(z)}{p_{\theta}(z|x)}\text{d}z \quad\quad\quad\quad\space\text{\small{(Baye's Rule)}} \\ &= \int q(z)\log \frac{p_{\theta}(x|z) p(z)}{p_{\theta}(z|x)} \frac{q(z)}{q(z)}\text{d}z \\ &= \mathbb{E}_{z\sim q(z)} \left[\log p_\theta(x|z)\right]-\mathcal{D}_{\text{KL}}(q(z)\Vert p(z)) +\mathcal{D}_{\text{KL}}(q(z)\Vert p_\theta(z|x)) \end{aligned}

Notice that the q(z)q(z) here is an arbitrary distribution over zz, p(z)p(z) is the pre-determined latent distribution. To make this expression more interpretable, we let the p(z)p(z) to be a parameterized posterior qϕ(zx)q_\phi(z|x) of the latent value.

But why do we do this? It’s because we must have some training objective. VAE is deterministic probability transition model (different from diffusion method ddpm we will introduce later), which means that a sample from a latent space must correspond to a sample in the image space. If we directly assign q(z)q(z) to be a Gaussian, we will have to manually assign the correspondence between Gaussian noise to generated image. Instead, we choose to model q(z)q(z) to be the posterior, and let a neural network learn the correspondence itself.

So logpθ(x)\log{p_\theta(x)} can be expressed as:

Ezq(z)[logpθ(xz)]DKL(qϕ(zx)p(z))+DKL(qϕ(zx)pθ(zx))\mathbb{E}_{z\sim q(z)} \left[\log p_\theta(x|z)\right]-\mathcal{D}_{\text{KL}}(q_\phi(z|x)\Vert p(z)) +\mathcal{D}_{\text{KL}}(q_\phi(z|x)\Vert p_\theta(z|x))

Notice that the third term measure the difference between ground-truth posterior and predicted posterior, which is intractable because we don’t know anything about ground-truth posterior pθ(zx)p_\theta(z|x). But we notice that KL divergence will always be positive. So we turn to optimize the following value, which is called ELBO (Evidence Lower BOund):

Ezqϕ(zx)[pθ(x,z)qϕ(zx)]=Ezq(z)[logpθ(xz)]DKL(qϕ(zx)p(z))\mathbb{E}_{z\sim q_\phi(z|x)}\left[\frac{p_\theta(x,z)}{q_\phi(z|x)}\right]= \mathbb{E}_{z\sim q(z)} \left[\log p_\theta(x|z)\right]-\mathcal{D}_{\text{KL}}(q_\phi(z|x)\Vert p(z))
You can actually derive ELBO in a much simpler way.
logp(x)=logp(x,z)dzlogp(x,z)dz(Jensin Inequality)=logp(x,z)q(zx)q(zx)dz=logEzq(zx)[p(x,z)q(zx)]\begin{aligned}\log p(x) &= \int \log p(x,z)\text{d}z \\ &\geq \log \int p(x,z) \text{d}z \quad \text{(Jensin Inequality)} \\ &=\log \int \frac{p(x,z)}{q(z|x)}q(z|x)\text{d}z \\ &= \log \mathbb{E}_{z\sim q(z|x)}\left[\frac{p(x,z)}{q(z|x)} \right]\end{aligned}

The above equation is much simpler and more interpretable though.

The moved-to-left term DKL(q(z)pθ(zx))\mathcal{D}_{\text{KL}}(q(z)\Vert p_\theta(z|x)) tells us that we are actually compromising the resemblance between predicted posterior and the ground-truth posterior when we are optimizing over ELBO.

The remaining two terms is also very meaningful. The Ezq(z)[logpθ(xz)]\mathbb{E}_{z\sim q(z)} \left[\log p_\theta(x|z)\right] term is a reconstruction loss. Maximize the term means maximizing the likelihood of observed training data. The DKL(q(z)p(z))\mathcal{D}_{\text{KL}}(q(z)\Vert p(z)) is regularization loss. Minimizing this term means bringing the predicted latent distribution closer to our pre-assigned normal latent distribution. We will see in later section why this is useful.

Architecture & Optimization

VAE designs a wonderful network structure that allows us to jointly optimize θ,ϕ\theta, \phi. The neural encoder outputs the mean and diagonal covariance of the latent normal and the latent prior is often selected to be a standard multivariate Gaussian

qϕ(zx)=N(z;μϕ(x),σϕ2(x)I)p(z)=N(z;0,I)q_\phi(z|x) = \mathcal{N}(z;\mu_\phi(x), \sigma^2_\phi(x)\mathbf{I})\\ p(z) = \mathcal{N}(z;0, \mathbf{I})

Illustration from https://mit-6s978.github.io/assets/pdfs/lec2_vae.pdf

Network Design of VAE. (Image Source from here)

The objective can be easily transformed to target loss.

For the reconstruction loss Ezq(z)[logpθ(xz)]\mathbb{E}_{z\sim q(z)} \left[\log p_\theta(x|z)\right], we take its opposite number so that we can minimize it . Firstly, we use one-step Monte Carlo sampling to get rid of the expectation. Then we model the predicted pθ(xz)p_\theta(x|z) to be Gaussian with fixed variance N(xx,σ0)\mathcal{N}(x|x^{\prime},\sigma_0). Then the loss term is simply 1/2σxx2+C1/2\sigma \Vert x - x^{\prime} \Vert ^ 2 + C. We can substitute it we L2 loss.

Regularization loss DKL(q(z)pθ(z))\mathcal{D}_{\text{KL}}(q(z)\Vert p_\theta(z)) can be calculated with simple math:

DKL(q(z)p(z))=z(zμ)22σ2N(μ,σ2)dz+zz22N(μ,σ2)dzzlogσN(μ,σ2)dz=E[(zμ)2]2σ2+E[z2]2logσ=12(1+σ2+μ2logσ2)\begin{aligned}&\quad\mathcal{D}_{\text{KL}}(q(z)\Vert p(z)) \\ &= -\int_z \frac{(z - \mu)^2}{2\sigma^2} \mathcal{N}(\mu, \sigma^2) \, dz + \int_z \frac{z^2}{2} \mathcal{N}(\mu, \sigma^2) \, dz - \int_z \log \sigma \mathcal{N}(\mu, \sigma^2) \, dz \\ &= -\frac{\mathbb{E}[(z - \mu)^2]}{2\sigma^2} + \frac{\mathbb{E}[z^2]}{2} - \log \sigma \\ &= \frac{1}{2} \left( -1 + \sigma^2 + \mu^2 - \log \sigma^2 \right)\end{aligned}

Diffusion Models

In VAE, we choose to model the transition function from p(z)p(z) to p(x)p(x) with a single neural network. But the transition is too hard for a single network to directly learn. Diffusion model adopts a more progressive transition. One famous interpretation for diffusion model is Hierarchical-VAE.

We spilt the whole transition into a lot of intermediate stages. At the same time, we assign each intermediate state to be a fixed distribution (To be precise, conditional intermediate state). The HVAE we get from the assumption is: DDPM.

DDPM

In essence, we want to model the process of “slowly” turning a Gaussian distribution p(z)p(z) into an image distribution p(x)p(x). But at first, let us take a step back and consider how to turn an image distribution p(x)p(x) into a Gaussian distribution p(z)p(z).

We want to achieve this through gradually adding noise to the image until it becomes pure Gaussian noise. We model the process as a Markovian process.

image.png

Noisifying and denoiding process of DDPM (Image Source from Luo et al. 2022)

For simplicity and uniformity, we let p(x)p(x) to be p(x0)p(x_0) and the latent distribution p(z)p(z) to be p(xT)p(x_T). We let the transition function q(xtxt1)q(x_t | x_{t-1}) to be:

xt=αtxt1+1αtϵt1ϵtN(0,I)x_t = \sqrt{\alpha_t}x_{t-1} + \sqrt{1-\alpha_t}\epsilon_{t-1} \quad \epsilon_t \sim \mathcal{N}(0,\mathbf{I})

The reason we want to choose such a strange coefficient is that we want to preserve the variance of the random variable.

With simple math, we can easily get the distribution for each intermediate state under the condition of initial image p(xtx0)p(x_t| x_0):

xtN(αˉtx0,(1αˉt)I)x_t \sim \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)\mathbf{I} )
Expand to see all the math
Iteratively apply the probability transition function: xt=αtxt1+1αtϵt1=αt(αt1xt2+1αt1ϵt2)+1αtϵt1=αtαt1xt2+1αtαt1ϵ=i=1tαix0+1i=1tαiϵ=αˉtx0+1αˉtϵ\begin{aligned}x_t &= \sqrt{\alpha_t}x_{t-1} + \sqrt{1-\alpha_t}\epsilon_{t-1} \\ &= \sqrt{\alpha_t}(\sqrt{\alpha_{t-1}}x_{t-2} + \sqrt{1-\alpha_{t-1}}\epsilon_{t-2}) + \sqrt{1-\alpha_t}\epsilon_{t-1}\\ &= \sqrt{\alpha_t \alpha_{t-1}}x_{t-2}+\sqrt{1-\alpha_t \alpha_{t-1}}\epsilon \\ &\dots\\ &= \sqrt{\prod_{i=1}^t \alpha_i}x_0 + \sqrt{1-\prod_{i=1}^t \alpha_i}\epsilon\\&=\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon\end{aligned}

Notice that we αt\alpha_t is a pre-assigned coefficient, hence the αˉt\bar{\alpha}_t is pre-assigned. Letting αˉT\bar{\alpha}_T take the value of 0, and we can get the final Gaussian latent distribution.

p(xt1xt)=q(xtxt1)p(xt1)p(xt)p(x_{t-1}|x_t) = \frac{q(x_t|x_{t-1})p(x_{t-1})}{p(x_t)}

Now that we know how to turn p(x0)p(x_0) to p(xT)p(x_T), we simply need to reverse the process to turn p(xT)p(x_T) to p(x0)p(x_0), this is where the Bayes’ Rule comes to rescue. For each intermediate step:

The equation is intractable, because we don’t know p(xt1)p(x_{t-1}) and p(xt)p(x_t). But we do know p(xt1x0)p(x_{t-1}|x_0) and p(xtx0)p(x_{t}|x_0). So we can turn to calculate:

p(xt1xt,x0)=q(xtxt1,x0)p(xt1x0)p(xtx0)p(x_{t-1}|x_t,x_0) = \frac{q(x_t|x_{t-1},x_0)p(x_{t-1}|x_0)}{p(x_t|x_0)}

We know the interpretable representation of all 3 terms on the right side of the equation, so we can directly calculate the conditional posterior p(xt1xt,x0)p(x_{t-1}| x_t,x_0):

N(αt(1αˉt1)xt+αˉt1(1αt)x01αˉtμt1(xt,x0),(1αt)(1αˉt1)1αˉtIΣt1) \mathcal{N}\left(\underbrace{\frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1}) \, x_t + \sqrt{\bar{\alpha}_{t-1}} (1 -\alpha_t) \, x_0}{1 - \bar{\alpha}_t}}_{\mu_{t-1}(x_t,x_0)}, \underbrace{\frac{(1 - \alpha_t)(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \, \mathbf{I}}_{\Sigma_{t-1}}\right)
Expand to see all the math

Just three normal distribution combined together, here we adopt an easier way from this blog. The exponential terms of the three distribution are combined to be:

12(xtαtxt121αt+xt1αˉt1x021αˉt1+xtαtx021αˉt)-\frac{1}{2}\left(\frac{\left\|x_t-\sqrt{\alpha_t}x_{t-1}\right\|^2}{1-\alpha_t} + \frac{\left\|x_{t-1}-\sqrt{\bar{\alpha}_{t-1}}x_{0}\right\|^2}{1-\bar{\alpha}_{t-1}} + \frac{\left\|x_t-\sqrt{\alpha_t}x_0\right\|^2}{1-\bar{\alpha}_t}\right)

We can see that the term is clearly quadratic with respect to xt1x_{t-1}. Therefore the final distribution must be a normal distribution. The coefficient of the quadratic term in this expression is 121αˉt(1αt)(1αˉt1)-\frac{1}{2}\frac{1 - \bar{\alpha}t}{(1 - \alpha_t)(1 - \bar{\alpha}{t-1})}, so we know the variance of the distribution is:

Σ=(1αt)(1αˉt1)1αˉtI\boldsymbol{\Sigma} = \frac{(1 - \alpha_t)(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \, \mathbf{I}

The coefficient of the quadratic term in this expression is αˉtx01αˉt1+αtxt1αt\frac{\sqrt{\bar{\alpha}_t} x_0}{1-\bar{\alpha}_{t-1}} + \frac{\sqrt{\alpha_t} x_t}{1-\alpha_{t}} , we divide it with the quaduatic coefficient and then divide it by 2 to get the mean:

μ=αt(1αˉt1)xt+αˉt1(1αt)x01αˉt\boldsymbol{\mu} = \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1}) \, x_t + \sqrt{\bar{\alpha}_{t-1}} (1 -\alpha_t) \, x_0}{1 - \bar{\alpha}_t}

We get the distribution from rigid math, sample from this distribution and taking iterative steps back, we have accomplished the task!

But notice that what we get is p(xt1xt,x0)p(x_{t-1}| x_t,x_0), but during inference, we don’t know the x0x_0(this is actually the image we want to generate). This is where the neural network comes to help. We train a gigantic leap, and train a neural network that predict x0x_0 from xtx_t.

Notice that p(xtx0)p(x_t | x_0) is a Gaussian, so we can use xtx_t to interpret x0x_0 with simple reparameterization.

x0=1αˉt(xt1αˉtϵ0)x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(x_t - \sqrt{1-\bar{\alpha}_t}\epsilon_0)

Previous work has empirically shown that predicting the noise ϵ0\epsilon_0 is better than predicting x0x_0. We parameterize ϵ0\epsilon_0 to be ϵθ(xt,t)\epsilon_\theta(x_t, t). And the distribution p(xt1xt,x0)p(x_{t-1}|x_t,x_0) becomes:

N(1αtxt1αt1αˉtαtϵθ(xt,t)μt1(xt,t),(1αt)(1αˉt1)1αˉtIΣt1) \mathcal{N}\left(\underbrace{\frac{1}{\sqrt{\alpha_t}}x_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}\sqrt{\alpha_t}}\epsilon_\theta(x_t, t)}_{\mu_{t-1}(x_t, t)}, \underbrace{\frac{(1 - \alpha_t)(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \, \mathbf{I}}_{\Sigma_{t-1}}\right)

So the overall training and inference procedure is:

image.png

Pseudo-code for training and inferencing process. (Image Source from Ho et al. 2020)

ELBO for HVAE

We have known from the last section that we can approximate the posterior transition distribution with a neural network that predict original image x0x_0 from xtx_t and tt. But does this really help with our ultimate goal of maximizing log likelihood Exp(x)[logpθ(x)]\mathbb{E}_{x\sim p(x)}[-\log{p_\theta(x)}]? We can actually derive a closed form ELBO with the assumption we made. (Please refer to this paper for a detailed derivation.)

logp(x)Eq(x1x0)[logpθ(x0x1)]reconstruction terrmDKL(q(xTx0)p(xT))prior matching termt=2TEq(xtx0)[DKL(p(xt1xt,x0)pθ(xt1xt))]denoising matching terrn\log{p(\boldsymbol{x})} \geq \underbrace{\mathbb{E}_{q(\boldsymbol{x}_1|\boldsymbol{x}_0)}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}_0|\boldsymbol{x}_1)]}_{\text{reconstruction terrm}}-\underbrace{D_{\mathrm{KL}}(q(\boldsymbol{x}_T|\boldsymbol{x}_0)\parallel p(\boldsymbol{x}_T))}_{\text{prior matching term}}\\-\sum_{t=2}^T\underbrace{\mathbb{E}_{q(\boldsymbol{x}_t|\boldsymbol{x}_0)}\left[D_{\mathrm{KL}}(p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t,\boldsymbol{x}_0)\parallel p_{\boldsymbol{\theta}}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t))\right]}_{\text{denoising matching terrn}}

We can notice that each term of the ELBO has its specific meaning.

  • The reconstruction term measure the expected likelihood of the predicted reconstructed image. This term is jointly optimized with Monte Carlo estimate.
  • The prior matching term tries to keep the final notified latent space as close as possible. It contains no trainable parameter, and by carefully selecting the noising parameter α\alpha, we can diminish the term to zero.
  • The denoising matching term is the primary part of the ELBO. The term aims to keep the predicted distribution to align with the ground-truth conditional posterior distribution as closely as possible.

We have proved in the last section that p(xt1xt,x0)p(x_{t-1}|x_t,x_0) is a normal distribution. Now suppose pθ(xt1xt)p_\theta(x_{t-1}|x_t) is a normal distribution N(μt(xt,x0),σt12I)\mathcal{N}(\mu_t(x_t,x_0), \sigma_{t-1}^2\mathbf{I}). Then we have:

argminθDKL(q(xt1xt,x0)pθ(xt1xt))=argminθ12σt2(1αt)2(1αˉt)αt[ϵ0ϵθ(xt,t)22]\begin{aligned}&\underset{\boldsymbol{\theta}}{\operatorname*{\operatorname*{\arg\min}}}D_{\mathrm{KL}}(q(x_{t\boldsymbol{-}1}|x_t,x_0)\parallel p_{\boldsymbol{\theta}}(x_{t\boldsymbol{-}1}|x_t))\\=&\arg\min_{\boldsymbol{\theta}}\frac{1}{2\sigma_t^2}\frac{(1-\alpha_t)^2}{(1-\bar{\alpha}_t)\alpha_t}\left[\left\|\epsilon_0-\epsilon_{\theta}(x_t,t)\right\|_2^2\right]\end{aligned}
Expand to see all the math
argminθDKL(q(xt1xt,x0)pθ(xt1xt))=argminDKL(N(xt1;μt1,Σt1)N(xt1;μθ,σt1I))=argminθ12[logΣt1Σt1d+tr(Σt11(Σt1)1)+(μθμt1)T(Σt1)1(μθμt1)]=argminθ12[(μθμt1)T(σt12(t)I)1(μθμt1)]+C=argminθ12σq2(t)[μθμt122]\begin{aligned}&\arg\min_{\theta}D_{\mathrm{KL}}(q(x_{t\boldsymbol{-}1}|x_t,x_0)\parallel p_{\theta}(x_{t\boldsymbol{-}1}|x_t))\\&=\arg\min D_{\mathrm{KL}}(\mathcal{N}(x_{t-1};\boldsymbol{\mu}_{t-1},\boldsymbol{\Sigma}_{t-1})\parallel\mathcal{N}(x_{t-1};\boldsymbol{\mu}_{\theta},\sigma_{t-1}\mathbf{I}))\\&=\arg\min_{\theta}\frac{1}{2}\left[\log\frac{|\boldsymbol{\Sigma}_{t-1}|}{|\boldsymbol{\Sigma}'_{t-1}|}-d+\mathrm{tr}(\boldsymbol{\Sigma}_{t-1}^{-1}(\boldsymbol{\Sigma}'_{t-1})^{-1})+(\boldsymbol{\mu}_{\theta}-\boldsymbol{\mu}_{t-1})^T(\boldsymbol{\Sigma}'_{t-1})^{-1}(\boldsymbol{\mu}_{\theta}-\boldsymbol{\mu}_{t-1})\right] \\&=\arg\min_{\theta}\frac{1}{2}\left[(\boldsymbol{\mu}_{\theta}-\boldsymbol{\mu}_{t-1})^T\left(\sigma_{t-1}^2(t)\mathbf{I}\right)^{-1}(\boldsymbol{\mu}_{\theta}-\boldsymbol{\mu}_{t-1})\right] + C\\&=\underset{\theta}{\operatorname*{\operatorname*{\arg\min}}}\frac{1}{2\sigma_q^2(t)}\left[\left\|\boldsymbol{\mu}_{\theta}-\boldsymbol{\mu}_{t-1}\right\|_2^2\right]\end{aligned}

Notice we have known that:

μt1(xt,t)=1αtxt1αt1αˉtαtϵ0\boldsymbol{\mu}_{t-1}(x_t, t) = \frac{1}{\sqrt{\alpha_t}}x_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}\sqrt{\alpha}_t}\epsilon_0

We let μθ\boldsymbol{\mu}_\theta take a similar form, substitute into the original objective and we have:

argminθ12σt2(1αt)2(1αˉt)αt[ϵ0ϵθ(xt,t)22]\arg\min_{\boldsymbol{\theta}}\frac{1}{2\sigma_t^2}\frac{(1-\alpha_t)^2}{(1-\bar{\alpha}_t)\alpha_t}\left[\left\|\epsilon_0-\epsilon_{\theta}(x_t,t)\right\|_2^2\right]

The optimizing objective align with the loss function we choose in the last section(discarding all the coefficients, which was empirically shown better in the original paper).

DDIM

Now let’s talk about DDIM. Notice that during the training process, we never directly use the transition p(xtxt1)p(x_t | x_{t-1}). This means that we actually never assume the forward(adding noise) process to be Markovian (Although we do derive most of the equation from a Markovian forward process). The only condition used during forwarding process is that the conditional distribution of xtx_t is a normal distribution:

xtN(αˉtx0,(1αˉt)I)x_t \sim \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)\mathbf{I} )

We can also notice that during the backward (denoising) process, the only distribution we sample from is p(xt1xt,x0)p(x_{t-1}| x_t,x_0). Under the Markovian assumption p(xt1xt,x0)=p(xt1xt)p(x_{t-1}| x_t,x_0)= p(x_{t-1}| x_t). But this is not necessarily true for non-Markovian forwarding process. But we can still derive a family of posterior distribution pσ(xt1xt,x0)p_\sigma(x_{t-1}| x_t,x_0) indexed by a vector σ\sigma with dimension tt . (This can be proved with method of undetermined coefficients. )

pσ(xt1xt,x0)=N(αˉt1x0+1αˉt1σt2xtαˉtx01αˉt,σt2I)p_\sigma(x_{t-1} | x_t, x_0) = \mathcal{N} \left( \sqrt{\bar{\alpha}_{t-1}} x_0 + \sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2} \cdot \frac{x_t - \sqrt{\bar{\alpha}_t} x_0}{\sqrt{1 - \bar{\alpha}_t}}, \sigma_t^2 \mathbf{I} \right)

Notice that if we let σt=(1αt)(1αˉt1)1αˉt\sigma_t = \frac{(1 - \alpha_t)(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t}, the forwarding process becomes Markovian and the posterior distribution becomes that of DDPM’s.

And following the same steps, we train a neural network to predict x0x_0(ϵ0\epsilon_0) from xtx_t and tt. Taking in the reparameterization of x0x_0, we get the following distribution to sample from pσ(xt1xt,x0)p_\sigma(x_{t-1}| x_t,x_0).

N(αˉt1αtˉxt+(1αˉt1σt2αˉt11αˉtαˉt)ϵ0,σt2I)\mathcal{N} \left( \frac{\sqrt{\bar{\alpha}_{t-1}}}{\sqrt{\bar{\alpha_t}}}x_t + \left( \sqrt{1-\bar{\alpha}_{t-1} - \sigma_t^2} - \frac{\sqrt{\bar{\alpha}_{t-1}}\sqrt{1-\bar{\alpha}_{t}}}{\sqrt{\bar{\alpha}_{t}}} \right)\epsilon_0, \sigma_t^2 \mathbf{I}\right)

Another good property that DDIM has is that we can drastically accelerate the inference process:

As the denoising objective L1L1 does not depend on the specific forward procedure, as long as qσ(xtx0)q_\sigma(x_t|x_0) is fixed, we may also consider forward processes with lengths smaller than TT.

Since any forwarding process with the conditional normal distribution property can be applied to the training procedure. It certainly includes denoising process with less time-steps. We can choose these sub-sequence as our inference sequence to accelerate the inference process. (Notice that this property also applies to the DDPM inference algorithm, since DDPM is only an instance of DDIM).

Specifically, given a sequence of intermediate states (xτ1,xτS)(x_{\tau_1}, \dots x_{\tau_S}), where τ\tau is a sub-sequence of the original time-steps [1,,T][1,\dots, T]. The posterior transition distribution pσ(xτt1xτt,x0)p_\sigma(x_{\tau_{t-1}}| x_{\tau_t},x_0) can be modeled as:

N(αˉτt1αˉτtxτt+(1αˉτt1σ~τt2αˉτt11αˉτtαˉτt)ϵ0,σ~τt2I)\mathcal{N} \left( \frac{\sqrt{\bar{\alpha}_{\tau_{t-1}}}}{\sqrt{\bar{\alpha}_{\tau_t}}}x_{\tau_t} + \left( \sqrt{1-\bar{\alpha}_{\tau_{t-1}} - \tilde{\sigma}_{\tau_t}^2 } - \frac{\sqrt{\bar{\alpha}_{\tau_{t-1}}}\sqrt{1-\bar{\alpha}_{\tau_t}}}{\sqrt{\bar{\alpha}_{\tau_t}}} \right)\epsilon_0, \tilde{\sigma}_{\tau_t}^2 \mathbf{I}\right)

Notice that in DDIM, only αˉt\bar{\alpha}_t is preassigned, any other notations need to be derived from αˉt\bar{\alpha}_t. For DDPM, we have σt=(1αt)(1αˉt1)1αˉt=(1αˉt/αˉt1)(1αˉt1)1αˉt\sigma_t = \frac{(1 - \alpha_t)(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} = \frac{(1 - \bar{\alpha}_t/\bar{\alpha}_{t-1})(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t}. So the variance σ~τt\tilde{\sigma}_{\tau_t} should take the form of:

σ~τt=(1αˉτt/αˉτt1)(1αˉτt1)1αˉτt\tilde{\sigma}_{\tau_t} = \frac{(1 - \bar{\alpha}_{\tau_t}/\bar{\alpha}_{\tau_{t-1}})(1 - \bar{\alpha}_{\tau_{t-1}})}{1 - \bar{\alpha}_{\tau_t}}