Advanced Data Analysis

Nguyen, Mike

42 Generative Models

Imagine you have a folder of ten thousand photographs of faces and you want a computer to produce a brand new face that no one has ever seen, yet looks completely real. A classifier cannot do this. A classifier reads a photo and returns a label such as “smiling” or “not smiling”. It never has to understand what makes a face a face. To invent a new one, the computer needs a model of what faces look like in general, not just a rule for telling two kinds apart. That distinction, between a model that labels data and a model that can produce data, is the subject of this chapter.

Most of this book studies discriminative models. A discriminative model learns $p(y \mid x)$ directly: given an input $x$, it returns a label or a number. A generative model instead learns the data distribution itself, either the joint $p(x, y)$ or the marginal $p(x)$. Once you have a model of $p(x)$, you can do more than classify. You can draw new samples that look like the training data, score how likely a point is, fill in missing parts of an input, and compress data into a compact code.

Intuition

A discriminative model learns where to draw the line between categories. A generative model learns the whole landscape the data live on, so it can walk around and produce new points anywhere on that landscape.

This chapter covers three families that dominate modern generative modeling: variational autoencoders (VAEs), generative adversarial networks (GANs), and diffusion models. For each one you will see the intuition first, then the math that makes it precise, then a sketch of the code, and finally what the result means. The three families reach the same goal, a model you can sample from, by very different routes, and the closing comparison lays out the trade-offs side by side. The runnable demonstration near the end uses base R and ggplot2 only, so every number in it reproduces without a deep learning backend.¹

42.1 Generative versus Discriminative

Before meeting the three model families, it is worth pinning down precisely what “generative” means and why it is a different job from the classification and regression that occupy much of this book. The cleanest way in is to write the joint distribution two ways:

\[ p(x, y) = p(y \mid x)\, p(x) = p(x \mid y)\, p(y). \]

A discriminative model targets the first factor $p(y \mid x)$ and ignores $p(x)$. Logistic regression, random forests, and the neural network classifiers from earlier chapters all sit here. They tend to predict well because they spend all of their capacity on the boundary between classes.

A generative model targets $p(x \mid y)$ and $p(y)$, or just $p(x)$ when there are no labels. Classification then comes from Bayes rule,

\[ p(y \mid x) = \frac{p(x \mid y)\, p(y)}{\sum_{y'} p(x \mid y')\, p(y')}. \]

Naive Bayes (Chapter 18) and Gaussian discriminant analysis (Chapter 20) are classic examples. Modeling $p(x)$ is harder than modeling a boundary, because the model has to account for everything in the data, not just the part that separates classes. So why pay that price? Classification is only one of several payoffs. A model of $p(x)$ lets you:

sample new data, for example images, audio, or molecules,
detect outliers as points with low $p(x)$ (Chapter 87),
impute missing features,
learn a useful low dimensional representation,
augment a small training set with synthetic examples.

Note

None of the three families below require the labels $y$. They learn the shape of $p(x)$ from inputs alone, which is why they are usually trained on unlabeled data, of which the world has far more than labeled data.

The three families take different routes to a model you can sample from. We start with the one closest to ideas already in this book, the autoencoder.

42.2 Variational Autoencoders

The first family stays closest to familiar territory. A variational autoencoder (VAE) is a latent variable model: it assumes each observed point $x$ is generated from an unobserved, low dimensional code $z$ that we never see directly. The generative story is

\[ z \sim p(z) = \mathcal{N}(0, I), \qquad x \sim p_\theta(x \mid z), \]

so the marginal likelihood of a data point integrates over the latent variable,

\[ p_\theta(x) = \int p_\theta(x \mid z)\, p(z)\, dz . \]

Here $z$ is the latent code, a short vector that captures the essence of $x$ (say, a handful of numbers summarizing a face), and $p_\theta(x \mid z)$ is the decoder, the network with parameters $\theta$ that turns a code back into a data point.

This connects to the autoencoders chapter (Chapter 41). A plain autoencoder maps $x$ to a code and back with no probabilistic meaning. A VAE keeps the encoder and decoder shape but makes both stochastic, which lets us sample new $x$ by drawing $z \sim \mathcal{N}(0, I)$ and passing it through the decoder.

Key idea

A plain autoencoder compresses and reconstructs. A VAE turns that same architecture into a generator by insisting the codes follow a known distribution, the standard normal, so you can sample a fresh code and decode it into a new, plausible $x$.

42.2.1 The problem with the integral

There is a snag. To train the model we would like to maximize $p_\theta(x)$, but the integral over $z$ has no closed form for a neural network decoder, and the posterior $p_\theta(z \mid x) = p_\theta(x \mid z)\, p(z) / p_\theta(x)$ (which codes are likely to have produced a given $x$) is equally hard. The VAE works around both by introducing a second network, an encoder $q_\phi(z \mid x)$, that approximates the true posterior. The encoder outputs a mean and a variance,

\[ q_\phi(z \mid x) = \mathcal{N}\!\big(\mu_\phi(x),\, \mathrm{diag}(\sigma_\phi^2(x))\big). \]

42.2.2 The evidence lower bound

Because we cannot maximize $\log p_\theta(x)$ directly, we maximize a quantity that is guaranteed to sit just below it and is computable. Start from the log marginal likelihood and insert $q_\phi$:

\[ \log p_\theta(x) = \log \int p_\theta(x \mid z)\, p(z)\, dz = \log \mathbb{E}_{q_\phi(z \mid x)}\!\left[\frac{p_\theta(x \mid z)\, p(z)}{q_\phi(z \mid x)}\right]. \]

Jensen inequality moves the log inside the expectation and gives a lower bound, the evidence lower bound or ELBO:

\[ \log p_\theta(x) \ge \mathbb{E}_{q_\phi(z \mid x)}\big[\log p_\theta(x \mid z)\big] - D_{\mathrm{KL}}\!\big(q_\phi(z \mid x)\,\|\,p(z)\big) =: \mathcal{L}(\theta, \phi; x). \]

It is worth deriving the same bound a second way, by an exact identity rather than an inequality, because the algebra reveals precisely what is lost. For any $q_\phi(z \mid x)$ with the same support as the posterior, multiply and divide inside the log and split the expectation:

\[ \begin{aligned} \log p_\theta(x) &= \mathbb{E}_{q_\phi(z \mid x)}\big[\log p_\theta(x)\big] \\ &= \mathbb{E}_{q_\phi}\!\left[\log \frac{p_\theta(x, z)}{p_\theta(z \mid x)}\right] \\ &= \mathbb{E}_{q_\phi}\!\left[\log \frac{p_\theta(x, z)}{q_\phi(z \mid x)}\right] + \mathbb{E}_{q_\phi}\!\left[\log \frac{q_\phi(z \mid x)}{p_\theta(z \mid x)}\right] \\ &= \mathcal{L}(\theta, \phi; x) + D_{\mathrm{KL}}\!\big(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x)\big), \end{aligned} \tag{42.1}\]

where the first line uses that $\log p_\theta(x)$ does not depend on $z$, the second uses $p_\theta(x) = p_\theta(x, z) / p_\theta(z \mid x)$, and the last line recognizes $\mathbb{E}_{q_\phi}[\log p_\theta(x, z) - \log q_\phi] = \mathcal{L}$ once $p_\theta(x, z) = p_\theta(x \mid z)\, p(z)$ is expanded. Equation Equation 42.1 is exact: the log evidence equals the ELBO plus a non negative KL term, so the ELBO is a lower bound, and the inequality is tight exactly when $q_\phi$ equals the true posterior.

The gap between $\log p_\theta(x)$ and the ELBO is exactly $D_{\mathrm{KL}}(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x))$, which is non negative,² so maximizing $\mathcal{L}$ does two good things at once: it pushes the bound up and it makes the approximate posterior close to the true one.

Read the two terms as a loss. The first term is a reconstruction term: it rewards a decoder that assigns high probability to the actual $x$ given its code. The second term is a Kullback Leibler penalty that pulls the encoded distribution toward the prior $\mathcal{N}(0, I)$. The KL term acts as a regularizer. Without it the encoder could place every input in its own tiny region of latent space, which would reconstruct well but leave gaps that decode to nothing. The penalty keeps the latent space packed near the prior so that random draws decode into plausible samples.

Intuition

Picture the reconstruction term as “remember the data” and the KL term as “stay tidy”. Pull only on the first and the model memorizes inputs into scattered, unusable codes. Pull only on the second and every input collapses to the same blurry average. The ELBO balances the two.

When both $q_\phi(z \mid x)$ and $p(z)$ are Gaussian the KL term has a closed form, so there is nothing to estimate by sampling. For a $d$ dimensional diagonal Gaussian,

\[ D_{\mathrm{KL}}\!\big(\mathcal{N}(\mu, \sigma^2 I)\,\|\,\mathcal{N}(0, I)\big) = \tfrac{1}{2} \sum_{j=1}^{d} \big(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\big). \tag{42.2}\]

The closed form is worth deriving once, because it explains why each term has the sign it does. The KL divergence between two univariate Gaussians $\mathcal{N}(\mu_1, \sigma_1^2)$ and $\mathcal{N}(\mu_2, \sigma_2^2)$ is

\[ D_{\mathrm{KL}} = \mathbb{E}_{q}\!\left[\log \frac{q(z)}{p(z)}\right] = \mathbb{E}_q\!\left[\log\frac{\sigma_2}{\sigma_1} - \frac{(z - \mu_1)^2}{2\sigma_1^2} + \frac{(z - \mu_2)^2}{2\sigma_2^2}\right]. \]

Under $q$ we have $\mathbb{E}_q[(z - \mu_1)^2] = \sigma_1^2$ and $\mathbb{E}_q[(z - \mu_2)^2] = \sigma_1^2 + (\mu_1 - \mu_2)^2$, so the expectation collapses to

\[ D_{\mathrm{KL}} = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac12 . \]

Setting the prior to the standard normal, $\mu_2 = 0$ and $\sigma_2 = 1$, and summing over the $d$ independent coordinates of the diagonal posterior gives Equation 42.2. The quadratic terms $\mu_j^2$ and $\sigma_j^2$ penalize codes that drift from the origin or spread too wide, while the $-\log \sigma_j^2$ term forbids the variance from collapsing to zero, which would make the posterior a point mass and the latent space unusable for sampling.

Posterior collapse

A characteristic VAE failure is posterior collapse, where for some latent coordinates the encoder ignores the input and returns $q_\phi(z_j \mid x) \approx p(z_j) = \mathcal{N}(0, 1)$, driving the KL term in Equation 42.2 to zero. The decoder then learns to reconstruct without using those coordinates, so they carry no information. This happens most often when the decoder is powerful enough to model $x$ on its own (for example an autoregressive decoder) and early in training when the reconstruction term is weak. Standard remedies are KL annealing (warm up the KL weight from zero), free bits (clamp each coordinate’s KL at a small floor $\lambda$ so the optimizer stops shrinking it once it is informative enough), or the $\beta$-VAE weighting below.

42.2.3 Amortized inference and the $\beta$-VAE

The encoder $q_\phi(z \mid x)$ deserves a second look. Classical variational inference fits a separate set of variational parameters $(\mu_i, \sigma_i)$ for every data point $x_i$, solving an optimization per example. The VAE instead trains a single network that maps any $x$ to its $(\mu_\phi(x), \sigma_\phi(x))$, sharing parameters across the whole dataset. This is amortized inference: the cost of inference is amortized over the dataset by paying once to train the encoder, after which a new point’s approximate posterior is a single forward pass. The price is the amortization gap, the extra slack between the ELBO achieved by the shared encoder and the best ELBO a per example optimizer could reach. Together with the approximation gap from $q_\phi$ being Gaussian rather than the true posterior, this is why the ELBO can sit well below $\log p_\theta(x)$.

The full training objective is the ELBO averaged over the data, and the gradient with respect to $\theta$ and $\phi$ is estimated by drawing a minibatch and one reparameterized $\epsilon$ per example. A widely used generalization scales the KL term by a constant $\beta$,

\[ \mathcal{L}_\beta(\theta, \phi; x) = \mathbb{E}_{q_\phi(z \mid x)}\big[\log p_\theta(x \mid z)\big] - \beta\, D_{\mathrm{KL}}\!\big(q_\phi(z \mid x)\,\|\,p(z)\big). \tag{42.3}\]

With $\beta = 1$ this is the exact ELBO. Setting $\beta > 1$ presses the aggregated posterior harder toward the factorized prior, which tends to produce more disentangled latent coordinates at the cost of reconstruction fidelity, while $\beta < 1$ favors sharper reconstructions and risks the scattered, gap riddled latent space the KL term is meant to prevent. The reconstruction term has its own implicit weight: with a Gaussian decoder of fixed variance $\sigma_x^2$, the negative log likelihood is $\|x - \hat x\|^2 / (2\sigma_x^2)$ up to a constant, so $\sigma_x^2$ acts as an inverse weight on reconstruction relative to the KL term. This is why a plain mean squared error decoder with no explicit variance implicitly fixes a particular trade-off.

42.2.4 The reparameterization trick

To train with gradient descent we need gradients of $\mathcal{L}$ with respect to the encoder parameters $\phi$. The trouble is that the expectation is taken over a distribution that depends on $\phi$, so we cannot just sample $z$ and backpropagate through the sampling step. The reparameterization trick rewrites the sample as a deterministic function of $\phi$ plus external noise,

\[ z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I), \]

where $\odot$ is elementwise multiplication. Now the randomness lives in $\epsilon$, which does not depend on $\phi$, and the path from $\phi$ to $z$ is differentiable. The Monte Carlo estimate of the ELBO using one or a few draws of $\epsilon$ then has a well defined gradient.

To see precisely why this matters, compare two ways of differentiating an expectation $\mathbb{E}_{q_\phi(z)}[f(z)]$ whose distribution depends on $\phi$. The score function or REINFORCE estimator uses the identity $\nabla_\phi q_\phi = q_\phi\, \nabla_\phi \log q_\phi$ to write

\[ \nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{q_\phi}\!\big[f(z)\, \nabla_\phi \log q_\phi(z)\big], \]

which is unbiased and needs only $\log q_\phi$, not a differentiable $f$, but is notoriously high variance because $f(z)$ multiplies the score with no cancellation. The reparameterization estimator instead pushes the parameter inside a deterministic map $z = g_\phi(\epsilon)$ with $\epsilon \sim p(\epsilon)$ fixed, giving

\[ \nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{p(\epsilon)}\!\big[\nabla_\phi f(g_\phi(\epsilon))\big] = \mathbb{E}_{p(\epsilon)}\!\big[f'(z)\, \nabla_\phi g_\phi(\epsilon)\big]. \]

The Jacobian $\nabla_\phi g_\phi$ couples the gradient of $f$ directly to the parameter, so the estimator exploits the smoothness of the decoder and typically has far lower variance, often by orders of magnitude. The cost is that it requires $f$ (here the integrand of the ELBO) to be differentiable in $z$, which holds for continuous data and a Gaussian $q_\phi$ but fails for discrete $z$, where one falls back on the score function estimator or a continuous relaxation such as the Gumbel-Softmax.

Key idea

You cannot differentiate through “draw a random sample”, but you can differentiate through “scale and shift a fixed random sample”. The reparameterization trick is exactly that move, and it is what makes the VAE trainable by ordinary backpropagation.

The next block sketches a VAE in Keras. It is shown for reference and is marked eval = FALSE because it needs the Keras/TensorFlow backend, which is not available in this build.

Show code

library(keras)

latent_dim <- 2

# Encoder outputs mean and log variance of q(z | x)
encoder_in <- layer_input(shape = 784)
h <- encoder_in %>% layer_dense(256, activation = "relu")
z_mean <- h %>% layer_dense(latent_dim)
z_log_var <- h %>% layer_dense(latent_dim)

# Reparameterization: z = mean + exp(log_var / 2) * epsilon
sampling <- function(args) {
    z_mean <- args[[1]]
    z_log_var <- args[[2]]
    eps <- k_random_normal(shape = k_shape(z_mean))
    z_mean + k_exp(0.5 * z_log_var) * eps
}
z <- list(z_mean, z_log_var) %>% layer_lambda(sampling)

# Decoder maps z back to x
decoder <- keras_model_sequential() %>%
    layer_dense(256, activation = "relu", input_shape = latent_dim) %>%
    layer_dense(784, activation = "sigmoid")
x_decoded <- decoder(z)

vae <- keras_model(encoder_in, x_decoded)

# Loss = reconstruction + KL regularizer
recon <- 784 * loss_binary_crossentropy(encoder_in, x_decoded)
kl <- -0.5 * k_sum(1 + z_log_var - k_square(z_mean) - k_exp(z_log_var), axis = -1L)
vae$add_loss(k_mean(recon + kl))
vae %>% compile(optimizer = "adam")

42.3 Generative Adversarial Networks

The VAE works hard to keep an explicit, if approximate, handle on the likelihood. A generative adversarial network (GAN) throws that away and asks a different question: instead of “how probable is this data point?”, it asks “can anyone tell my fakes from the real thing?” Think of an art forger and a detective. The forger (the generator) paints fakes; the detective (the discriminator) tries to spot them. Each makes the other better. When the forger finally wins, the fakes are indistinguishable from genuine work.

Concretely, two networks compete. A generator $G$ maps noise $z \sim p(z)$ to a fake sample $G(z)$. A discriminator $D$ takes a sample and outputs the probability that it is real. Training is a minimax game,

\[ \min_G \max_D \; \mathbb{E}_{x \sim p_{\mathrm{data}}}\big[\log D(x)\big] + \mathbb{E}_{z \sim p_z}\big[\log\big(1 - D(G(z))\big)\big]. \]

The discriminator wants to push $D(x)$ toward one on real data and toward zero on fakes. The generator wants the opposite, to make $D(G(z))$ close to one. The $\min_G \max_D$ notation reads literally: the inner $\max_D$ trains the best detective it can, and the outer $\min_G$ trains the forger to defeat that best detective.

Note

A GAN never writes down or evaluates $p(x)$. It is an implicit generative model: it can hand you samples but cannot tell you how likely any particular point is. That is the price of skipping the likelihood.

42.3.1 The optimal discriminator

It helps to ask what the perfect detective would do, holding the forger fixed. Fix the generator, so the distribution of fakes $p_g$ is fixed. Write the value function as a single integral over $x$,

\[ V(D) = \int_x \Big( p_{\mathrm{data}}(x) \log D(x) + p_g(x) \log\big(1 - D(x)\big) \Big)\, dx . \]

For each $x$ the integrand has the form $a \log d + b \log(1 - d)$, which is maximized at $d = a / (a + b)$. The optimal discriminator is therefore the density ratio

\[ D^*(x) = \frac{p_{\mathrm{data}}(x)}{p_{\mathrm{data}}(x) + p_g(x)} . \]

Substitute $D^*$ back into the value function. After collecting terms it becomes

\[ V(G, D^*) = -\log 4 + 2\, D_{\mathrm{JS}}\!\big(p_{\mathrm{data}} \,\|\, p_g\big), \tag{42.4}\]

The collecting of terms is short enough to show. Write $m = \tfrac12(p_{\mathrm{data}} + p_g)$ for the mixture. Substituting $D^* = p_{\mathrm{data}} / (2m)$ and $1 - D^* = p_g / (2m)$ gives

\[ \begin{aligned} V(G, D^*) &= \mathbb{E}_{p_{\mathrm{data}}}\!\left[\log \frac{p_{\mathrm{data}}}{2m}\right] + \mathbb{E}_{p_g}\!\left[\log \frac{p_g}{2m}\right] \\ &= \mathbb{E}_{p_{\mathrm{data}}}\!\left[\log \frac{p_{\mathrm{data}}}{m}\right] + \mathbb{E}_{p_g}\!\left[\log \frac{p_g}{m}\right] - 2\log 2 \\ &= D_{\mathrm{KL}}(p_{\mathrm{data}} \,\|\, m) + D_{\mathrm{KL}}(p_g \,\|\, m) - 2\log 2 \\ &= 2\, D_{\mathrm{JS}}(p_{\mathrm{data}} \,\|\, p_g) - \log 4, \end{aligned} \]

using $D_{\mathrm{JS}}(p \| q) = \tfrac12 D_{\mathrm{KL}}(p \| m) + \tfrac12 D_{\mathrm{KL}}(q \| m)$ in the last line. This is Equation 42.4. Because $D_{\mathrm{JS}} \ge 0$ with equality iff $p_g = p_{\mathrm{data}}$, the outer minimization over $G$ drives $p_g$ to the data distribution and the value to its floor $-\log 4$.

where $D_{\mathrm{JS}}$ is the Jensen Shannon divergence, a symmetric measure of how far two distributions are apart.³ Since the JS divergence is zero only when $p_g = p_{\mathrm{data}}$, the generator that wins the game matches the data distribution exactly, with the value reaching its minimum of $-\log 4$.

Key idea

Underneath the forger-versus-detective story, the GAN is secretly minimizing a divergence between the fake and real distributions. The adversarial game is just a clever, sample-only way to estimate and descend that divergence without ever computing a density.

The demonstration below evaluates $D^*(x)$ for two Gaussians so the density ratio result is visible as a concrete curve.

42.3.2 Mode collapse and instability

The clean theory above assumes both networks are trained to optimality at each step, which never happens in practice. Two failures are common, and knowing their names is half of diagnosing a misbehaving GAN.

The first is mode collapse: the generator discovers a few outputs that reliably fool the current discriminator and produces only those, ignoring whole regions of the data. The samples look fine one at a time but lack variety, like a forger who only knows how to paint one convincing painting.

The second is training instability: the two networks chase a moving target. If the discriminator gets too strong, the term $\log(1 - D(G(z)))$ saturates and the generator gradient vanishes, so the forger stops learning precisely when it is losing badly. A common fix is to have the generator maximize $\log D(G(z))$ instead, which gives stronger gradients early in training. Balancing the two networks, tuning learning rates, and choosing alternative losses such as the Wasserstein objective are all standard parts of getting a GAN to converge.

The saturation is easy to make precise. Let $u = D(G(z))$ be the discriminator’s probability that a fake is real, small early in training when fakes are obvious. The original generator loss $\log(1 - u)$ has derivative $-1/(1 - u)$, which is near $-1$ when $u \approx 0$: a tiny, nearly constant gradient exactly when the generator most needs a push. The non saturating surrogate maximizes $\log u$, whose derivative $1/u$ blows up as $u \to 0$, supplying a large gradient when the fakes are poor. Both losses share the same fixed point but have opposite gradient behavior in the regime that matters early on, which is why the non saturating form is the practical default.

42.3.3 Why JS divergence is a brittle objective

The JS divergence result Equation 42.4 is elegant but also diagnoses the deeper instability. When $p_{\mathrm{data}}$ and $p_g$ live on low dimensional manifolds in a high dimensional space, as natural images do, their supports are almost surely disjoint or overlap on a set of measure zero early in training. On disjoint supports a perfect discriminator exists, $D_{\mathrm{JS}}$ is pinned at its maximum $\log 2$ regardless of how far apart the two distributions are, and its gradient with respect to the generator is zero almost everywhere. The generator receives no signal about which direction would move $p_g$ toward $p_{\mathrm{data}}$. This is the manifold view of why a too strong discriminator kills learning.

The Wasserstein GAN replaces JS with the earth mover (Wasserstein-1) distance, which by Kantorovich-Rubinstein duality equals

\[ W(p_{\mathrm{data}}, p_g) = \sup_{\|f\|_L \le 1} \; \mathbb{E}_{p_{\mathrm{data}}}[f(x)] - \mathbb{E}_{p_g}[f(x)], \tag{42.5}\]

where the supremum is over all 1-Lipschitz functions $f$. Unlike JS, $W$ varies smoothly with the distance between $p_g$ and $p_{\mathrm{data}}$ even when their supports do not overlap, so it provides a usable gradient everywhere. In practice $f$ (now called a critic, with no sigmoid and a real valued output) is constrained to be Lipschitz by weight clipping or, more reliably, a gradient penalty that pushes $\|\nabla_x f\|$ toward one. The critic loss in Equation 42.5 also tracks sample quality, which the original GAN loss does not, so it partially answers the warning below.

Warning

Unlike the VAE loss, a GAN loss curve does not tell you much. The generator and discriminator losses can stay flat while sample quality improves or collapses. Judge a GAN by looking at its samples, not by watching the loss go down.

A minimal GAN training loop in Keras follows. It is marked eval = FALSE because it needs the Keras/TensorFlow backend.

Show code

library(keras)

generator <- keras_model_sequential() %>%
    layer_dense(128, activation = "relu", input_shape = 100) %>%
    layer_dense(784, activation = "tanh")

discriminator <- keras_model_sequential() %>%
    layer_dense(128, activation = "relu", input_shape = 784) %>%
    layer_dense(1, activation = "sigmoid")
discriminator %>% compile(optimizer = "adam", loss = "binary_crossentropy")

# Combined model trains the generator with the discriminator frozen
freeze_weights(discriminator)
gan_in <- layer_input(shape = 100)
gan_out <- gan_in %>% generator() %>% discriminator()
gan <- keras_model(gan_in, gan_out)
gan %>% compile(optimizer = "adam", loss = "binary_crossentropy")

# Each step: update D on real plus fake, then update G through the frozen D.

42.4 Diffusion Models

Here is a different idea entirely. Take a clear photo and add a tiny bit of static noise, then a little more, then more, until after many steps nothing is left but television snow. That destruction is easy and needs no learning. Now suppose you could run the tape backward, removing a little noise at each step. If you could learn that one denoising step, you could start from pure static and walk all the way back to a brand new clear photo. That is the whole idea of a diffusion model.

Diffusion models generate data by learning to reverse a gradual noising process. The idea goes back to Sohl-Dickstein et al. (2015) and was turned into a practical image generator by Ho et al. (2020). The forward process slowly destroys structure in the data by adding Gaussian noise over $T$ steps. The reverse process, learned by a network, removes noise one step at a time, turning pure noise back into a sample.

Intuition

Noising is trivial and requires no model. The model only has to learn the reverse, and because each reverse step undoes just a sliver of noise, it is a much easier task than producing a whole image in one shot.

42.4.1 Forward noising process

Fix a variance schedule $\beta_1, \dots, \beta_T$ with small positive values. The forward process is a Markov chain that adds a little noise at each step,

\[ q(x_t \mid x_{t-1}) = \mathcal{N}\!\big(\sqrt{1 - \beta_t}\, x_{t-1},\; \beta_t I\big). \]

The scaling by $\sqrt{1 - \beta_t}$ shrinks the previous value just enough that the variance stays controlled as noise accumulates, rather than growing without bound. After many steps $x_T$ is essentially standard Gaussian noise with all structure gone.

42.4.2 Closed form for any step

Simulating the chain step by step would be slow during training. A convenient property rescues us: you can jump from the clean data $x_0$ to any noised version $x_t$ in a single shot, without running the intermediate steps. Let $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$. Then

\[ q(x_t \mid x_0) = \mathcal{N}\!\big(\sqrt{\bar{\alpha}_t}\, x_0,\; (1 - \bar{\alpha}_t) I\big), \]

which can be written with an explicit noise variable,

\[ x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I). \]

As $t$ grows, $\bar{\alpha}_t \to 0$, so the signal $\sqrt{\bar{\alpha}_t}\, x_0$ fades and the noise term takes over.

The closed form follows from composing two Gaussian steps. Write $x_t = \sqrt{\alpha_t}\, x_{t-1} + \sqrt{1 - \alpha_t}\, \epsilon_{t-1}$ with independent $\epsilon_{t-1} \sim \mathcal{N}(0, I)$. Substitute the same identity for $x_{t-1}$:

\[ x_t = \sqrt{\alpha_t \alpha_{t-1}}\, x_{t-2} + \sqrt{\alpha_t (1 - \alpha_{t-1})}\, \epsilon_{t-2} + \sqrt{1 - \alpha_t}\, \epsilon_{t-1}. \]

The last two terms are independent zero mean Gaussians, so they sum to a single Gaussian whose variance is the sum of variances, $\alpha_t(1 - \alpha_{t-1}) + (1 - \alpha_t) = 1 - \alpha_t \alpha_{t-1}$. Hence $x_t = \sqrt{\alpha_t \alpha_{t-1}}\, x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}}\, \bar\epsilon$ with $\bar\epsilon \sim \mathcal{N}(0, I)$. Iterating this merging of Gaussians all the way down to $x_0$ replaces the product of step coefficients with $\bar\alpha_t = \prod_{s=1}^t \alpha_s$ and leaves total variance $1 - \bar\alpha_t$, which is exactly the stated closed form. The choice of scaling by $\sqrt{1 - \beta_t}$ rather than any other constant is what makes the noised variance telescope to $1 - \bar\alpha_t$ and keeps $\mathrm{Var}(x_t)$ bounded by one when the data are standardized.

Tip

This closed form is also what makes the runnable demonstration cheap. To show data dissolving at step $t$, we never simulate $t$ separate noising steps. We apply the one formula above once.

The simulation later in the chapter uses exactly this closed form to push two dimensional data toward noise.

42.4.3 Reverse denoising model

To generate, we run the chain backward. The reverse step is also modeled as a Gaussian,

\[ p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\big(\mu_\theta(x_t, t),\; \Sigma_\theta(x_t, t)\big), \]

with the mean produced by a network. Training the reverse process can be derived as a variational bound similar to the ELBO, but Ho et al. (2020) showed a simpler reparameterization. Rather than predicting the mean directly, the network predicts the noise $\epsilon$ that was added. Given the closed form above, a network $\epsilon_\theta(x_t, t)$ is trained to recover that noise, and the training objective collapses to a plain regression on the noise,

\[ \mathcal{L}_{\mathrm{simple}}(\theta) = \mathbb{E}_{x_0, \epsilon, t}\Big[\big\| \epsilon - \epsilon_\theta\big(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon,\; t\big)\big\|^2\Big]. \tag{42.6}\]

42.4.4 From the variational bound to the simple loss

The simple objective is not an ad hoc choice, it is a reweighted version of a genuine ELBO. The variational bound on the negative log likelihood factors over the chain into three groups of terms,

\[ -\log p_\theta(x_0) \le \mathbb{E}_q\Big[ \underbrace{D_{\mathrm{KL}}(q(x_T \mid x_0) \,\|\, p(x_T))}_{L_T} + \sum_{t > 1} \underbrace{D_{\mathrm{KL}}(q(x_{t-1} \mid x_t, x_0) \,\|\, p_\theta(x_{t-1} \mid x_t))}_{L_{t-1}} \underbrace{- \log p_\theta(x_0 \mid x_1)}_{L_0} \Big], \tag{42.7}\]

obtained by the same multiply-and-divide trick used for the VAE, now applied to the joint over the whole trajectory $x_{1:T}$. The term $L_T$ has no learnable parameters (the forward process is fixed and $x_T$ is essentially the prior $\mathcal{N}(0, I)$), so it is a constant. Each $L_{t-1}$ compares the learned reverse step against the forward posterior $q(x_{t-1} \mid x_t, x_0)$, which, crucially, is Gaussian in closed form. By Bayes rule applied to the Gaussian forward steps,

\[ q(x_{t-1} \mid x_t, x_0) = \mathcal{N}\!\big(\tilde\mu_t(x_t, x_0),\; \tilde\beta_t I\big), \qquad \tilde\beta_t = \frac{1 - \bar\alpha_{t-1}}{1 - \bar\alpha_t}\,\beta_t, \]

with posterior mean

\[ \tilde\mu_t(x_t, x_0) = \frac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1 - \bar\alpha_t}\, x_0 + \frac{\sqrt{\alpha_t}\,(1 - \bar\alpha_{t-1})}{1 - \bar\alpha_t}\, x_t . \]

Because both distributions in $L_{t-1}$ are Gaussian with the same (fixed) covariance, the KL reduces to a scaled squared distance between their means, $L_{t-1} = \tfrac{1}{2\tilde\beta_t}\|\tilde\mu_t - \mu_\theta(x_t, t)\|^2 + \text{const}$. Now substitute the closed form $x_0 = (x_t - \sqrt{1 - \bar\alpha_t}\,\epsilon)/\sqrt{\bar\alpha_t}$ into $\tilde\mu_t$ and parameterize the network’s mean in the matching form

\[ \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\Big(x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}}\, \epsilon_\theta(x_t, t)\Big). \tag{42.8}\]

The two means then differ only through $\epsilon - \epsilon_\theta$, and the squared difference becomes

\[ L_{t-1} = \frac{\beta_t^2}{2\tilde\beta_t\,\alpha_t (1 - \bar\alpha_t)} \big\|\epsilon - \epsilon_\theta(x_t, t)\big\|^2 + \text{const}. \tag{42.9}\]

This is the per timestep noise prediction loss, weighted by a $t$ dependent factor. Ho et al. (2020) found that simply dropping that factor (setting it to one) and averaging over $t$ uniformly, which yields exactly Equation 42.6, trains better and is the form used in practice. The reweighting downweights the very small $t$ terms, which correspond to imperceptible high frequency detail, and lets the network spend capacity on the harder, more visually important denoising at larger $t$.

In words, the network is shown a noised point and the timestep it came from, and its only job is to guess the noise that was mixed in. Sampling then alternates a denoising step using $\epsilon_\theta$ with a small injection of fresh noise, from $t = T$ down to $t = 1$.

The sampling step is the reverse parameterization Equation 42.8 made concrete. Starting from $x_T \sim \mathcal{N}(0, I)$, each step computes

\[ x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\Big(x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}}\, \epsilon_\theta(x_t, t)\Big) + \sigma_t\, z, \qquad z \sim \mathcal{N}(0, I), \tag{42.10}\]

with $z = 0$ on the final step. The injected variance is typically $\sigma_t^2 = \tilde\beta_t$ or the simpler $\sigma_t^2 = \beta_t$; both are consistent with the bound and give similar samples. The fresh noise $\sigma_t z$ is essential: without it the chain would collapse to the conditional mean at every step and produce oversmoothed, low diversity samples, the diffusion analogue of the blurry VAE failure.

42.4.5 Connection to score matching

The noise prediction network is, up to scaling, an estimate of the score of the noised data, the gradient of the log density. From the closed form $x_t = \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1 - \bar\alpha_t}\,\epsilon$, the score of the marginal satisfies $\nabla_{x_t} \log q(x_t \mid x_0) = -\epsilon / \sqrt{1 - \bar\alpha_t}$, so a network trained to predict $\epsilon$ is equivalently learning $\epsilon_\theta(x_t, t) \approx -\sqrt{1 - \bar\alpha_t}\, \nabla_{x_t} \log q(x_t)$. This is the bridge to the score based view of Song and Ermon, in which sampling is Langevin dynamics that follows the learned score uphill in density, and to the continuous time formulation where the forward chain is a stochastic differential equation and sampling solves its reverse SDE (or a deterministic probability flow ODE). The ODE view is the basis of fast deterministic samplers such as DDIM, which reduce the hundreds of sequential steps of Equation 42.10 to a few dozen by taking larger, non Markovian strides.

Why sampling is the bottleneck

Training cost is one network evaluation per example per step, comparable to other deep models. The expense is at generation: producing a single sample requires $T$ sequential forward passes through $\epsilon_\theta$ (often $T = 1000$), each depending on the previous, with no parallelism across steps. This is the fundamental reason diffusion sampling is orders of magnitude slower than a VAE or GAN, which need one pass, and why distillation and ODE solvers that cut the step count are an active area.

Key idea

The forbidding-looking diffusion objective is, in the end, plain regression: predict the noise with mean squared error. That stability is a big part of why diffusion models train more reliably than GANs, which have no single well-behaved loss to descend.

A sketch of the noise prediction loss in Keras follows. It is marked eval = FALSE because it depends on the Keras backend.

Show code

library(keras)

# eps_model(x_t, t) predicts the noise added to reach x_t.
# Training: sample x0 from data, t uniform in 1..T, epsilon ~ N(0, I),
# form x_t = sqrt(abar[t]) * x0 + sqrt(1 - abar[t]) * epsilon,
# then minimize mean((epsilon - eps_model(x_t, t))^2).

train_step <- function(x0, abar, T) {
    t <- sample.int(T, nrow(x0), replace = TRUE)
    eps <- matrix(rnorm(length(x0)), nrow = nrow(x0))
    a <- sqrt(abar[t])
    b <- sqrt(1 - abar[t])
    x_t <- a * x0 + b * eps
    # loss = mean((eps - eps_model(x_t, t))^2); backprop with optimizer
}

42.5 Runnable Demonstration

The neural network versions above need a GPU and a deep learning backend, but two of the central mathematical facts can be checked exactly with base R and ggplot2 on a laptop. This section does two things. First it runs the forward diffusion process on a two dimensional Gaussian mixture and watches the data dissolve into noise, confirming the closed form from the diffusion section. Second it evaluates the optimal GAN discriminator for two one dimensional Gaussians and confirms the density ratio formula $D^*(x) = p_{\mathrm{data}} / (p_{\mathrm{data}} + p_g)$.

When to use this

Treat the demonstration as a sanity check on the theory, not as a recipe for generating images. It isolates the math, the noising schedule and the optimal discriminator, in a setting small enough to verify by eye and by number.

Show code

library(ggplot2)
set.seed(1)

42.5.1 Forward diffusion on a 2-D mixture

We sample a small mixture of three Gaussians, then apply the closed form $x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon$ at a few timesteps. Figure 42.1 shows the result: the three clear modes at $t = 0$ dissolve toward a single noise cloud as $t$ increases.

Show code

# Sample a 2-D Gaussian mixture as the data distribution p(x0)
n <- 600
centers <- matrix(c(-3, 0, 3, 0, 0, 3), ncol = 2, byrow = TRUE)
comp <- sample(1:3, n, replace = TRUE)
x0 <- centers[comp, ] + matrix(rnorm(2 * n, sd = 0.35), ncol = 2)

# Linear variance schedule and cumulative product alpha_bar
T_steps <- 40
beta <- seq(1e-4, 0.06, length.out = T_steps)
alpha <- 1 - beta
alpha_bar <- cumprod(alpha)

# Apply the closed form at selected steps
steps_to_show <- c(0, 5, 15, 40)
make_frame <- function(t) {
    if (t == 0) {
        xt <- x0
    } else {
        ab <- alpha_bar[t]
        eps <- matrix(rnorm(2 * n), ncol = 2)
        xt <- sqrt(ab) * x0 + sqrt(1 - ab) * eps
    }
    data.frame(x = xt[, 1], y = xt[, 2], step = paste0("t = ", t))
}
frames <- do.call(rbind, lapply(steps_to_show, make_frame))
frames$step <- factor(frames$step, levels = paste0("t = ", steps_to_show))

ggplot(frames, aes(x, y)) +
    geom_point(alpha = 0.4, size = 0.8, color = "steelblue") +
    facet_wrap(~step, nrow = 1) +
    coord_equal() +
    labs(
        title = "Forward diffusion: structured data degrading to noise",
        x = NULL, y = NULL
    )

Figure 42.1: Forward diffusion applied to a two dimensional Gaussian mixture. The three modes visible at step zero degrade into standard Gaussian noise by step forty as the closed form mixes in progressively more noise.

At $t = 0$ the three modes are clear. By $t = 40$ the cumulative product $\bar\alpha_t$ is near zero, so the signal is gone and the points look like standard Gaussian noise.

To quantify the loss of structure, we report two summaries per step: the total variance of the cloud, and a Kullback Leibler divergence between the marginal of the first coordinate at step $t$ and a standard normal, estimated from histograms. Both should move toward the noise reference as $t$ grows. Table Table 42.1 reports these summaries at selected steps.

Show code

# KL(p || q) for two discrete histograms on a common grid
kl_hist <- function(samp, ref_density, breaks) {
    p <- hist(samp, breaks = breaks, plot = FALSE)$density
    mids <- head(breaks, -1) + diff(breaks) / 2
    q <- ref_density(mids)
    p <- p / sum(p)
    q <- q / sum(q)
    keep <- p > 0 & q > 0
    sum(p[keep] * log(p[keep] / q[keep]))
}

grid <- seq(-6, 6, length.out = 41)
ref <- function(m) dnorm(m)

summ <- lapply(0:T_steps, function(t) {
    if (t == 0) {
        xt <- x0
    } else {
        ab <- alpha_bar[t]
        eps <- matrix(rnorm(2 * n), ncol = 2)
        xt <- sqrt(ab) * x0 + sqrt(1 - ab) * eps
    }
    data.frame(
        step = t,
        total_var = round(sum(apply(xt, 2, var)), 3),
        kl_to_normal = round(kl_hist(xt[, 1], ref, grid), 3)
    )
})
diffusion_tab <- do.call(rbind, summ)
diffusion_tab <- diffusion_tab[diffusion_tab$step %in% c(0, 5, 10, 20, 30, 40), ]
knitr::kable(
    diffusion_tab,
    row.names = FALSE,
    col.names = c("Step t", "Total variance", "KL of x1 to N(0,1)"),
    caption = "Structure decays as the forward process adds noise."
)

Table 42.1: Structure decays as the forward process adds noise.

Step t	Total variance	KL of x1 to N(0,1)
0	8.290	2.509
5	8.172	2.423
10	7.810	2.166
20	6.642	1.402
30	5.283	0.810
40	3.857	0.358

The KL divergence to a standard normal falls steeply over the first several steps and approaches zero, confirming that the marginal becomes standard Gaussian as the chain runs.

42.5.2 Verifying the forward posterior mean

The derivation of the simple loss hinged on the forward posterior $q(x_{t-1} \mid x_t, x_0)$ being Gaussian with mean $\tilde\mu_t(x_t, x_0)$. We can confirm that mean numerically without any network. Fix $x_0$ and a step $t$, generate many pairs $(x_{t-1}, x_t)$ from the forward chain, then estimate $\mathbb{E}[x_{t-1} \mid x_t \approx x_t^\star, x_0]$ by averaging the $x_{t-1}$ whose $x_t$ landed near a target value, and compare against the closed form.

Show code

t <- 20
x0_scalar <- 2.0
at <- alpha[t]
abar_t <- alpha_bar[t]
abar_tm1 <- alpha_bar[t - 1]
bt <- beta[t]

# Closed-form forward posterior mean tilde_mu_t at a chosen x_t = xt_star
xt_star <- 1.0
mu_closed <- (sqrt(abar_tm1) * bt) / (1 - abar_t) * x0_scalar +
    (sqrt(at) * (1 - abar_tm1)) / (1 - abar_t) * xt_star

# Monte Carlo: simulate the two-step forward chain from x0, bin on x_t
N <- 4e6
xtm1 <- sqrt(abar_tm1) * x0_scalar + sqrt(1 - abar_tm1) * rnorm(N)
xt <- sqrt(at) * xtm1 + sqrt(bt) * rnorm(N)
near <- abs(xt - xt_star) < 0.02
mu_mc <- mean(xtm1[near])

knitr::kable(
    data.frame(
        quantity = c("closed-form tilde_mu_t", "Monte Carlo estimate"),
        value = round(c(mu_closed, mu_mc), 4)
    ),
    col.names = c("Quantity", "Value"),
    caption = "Closed-form forward posterior mean matches the simulated conditional mean."
)

Closed-form forward posterior mean matches the simulated conditional mean.
Quantity	Value
closed-form tilde_mu_t	1.0988
Monte Carlo estimate	1.0975

The two values agree to within Monte Carlo error, confirming the Gaussian posterior mean that the variational bound relies on.

42.5.3 Optimal GAN discriminator as a density ratio

Take real data as $\mathcal{N}(1, 1)$ and the generator distribution as $\mathcal{N}(-1, 1.5^2)$. The theory says the best discriminator is $D^*(x) = p_{\mathrm{data}}(x) / (p_{\mathrm{data}}(x) + p_g(x))$. We compute that curve directly and also fit a discriminator by logistic regression on real versus fake samples to check that the trained classifier recovers the same shape. Figure 42.2 overlays the closed form curve and the fitted classifier.

Show code

# True densities
p_data <- function(x) dnorm(x, mean = 1, sd = 1.0)
p_g <- function(x) dnorm(x, mean = -1, sd = 1.5)

xs <- seq(-7, 7, length.out = 400)
d_star <- p_data(xs) / (p_data(xs) + p_g(xs))

# Empirical check: label real = 1, fake = 0, fit logistic regression
m <- 4000
real <- rnorm(m, mean = 1, sd = 1.0)
fake <- rnorm(m, mean = -1, sd = 1.5)
df <- data.frame(
    x = c(real, fake),
    label = c(rep(1, m), rep(0, m))
)
fit <- glm(label ~ poly(x, 4), data = df, family = binomial())
d_fit <- predict(fit, newdata = data.frame(x = xs), type = "response")

plot_df <- rbind(
    data.frame(x = xs, D = d_star, source = "optimal D*(x)"),
    data.frame(x = xs, D = d_fit, source = "fitted logistic D")
)

ggplot(plot_df, aes(x, D, color = source, linetype = source)) +
    geom_line(linewidth = 1) +
    ylim(0, 1) +
    labs(
        title = "Optimal GAN discriminator is the density ratio",
        x = "x", y = "D(x)", color = NULL, linetype = NULL
    ) +
    theme(legend.position = "bottom")

Figure 42.2: The optimal GAN discriminator as a density ratio. The closed form curve $D^*(x) = p_{data}/(p_{data}+p_g)$ and a logistic regression fitted on real versus fake samples coincide, sitting near one where the data density dominates and near zero where the generator density dominates.

The closed form $D^*$ and the fitted classifier overlap closely. Where the data density dominates, $D^*$ is near one. Where the generator density dominates, $D^*$ is near zero. At the crossover point the two densities are equal and $D^* = 0.5$, which is the point a perfectly trained generator drives the whole curve toward. Table 42.2 confirms numerically that the value at the optimal discriminator equals $-\log 4 + 2\,D_{\mathrm{JS}}$.

Show code

# Numerically relate the value at D* to Jensen-Shannon divergence
mix <- function(x) 0.5 * (p_data(x) + p_g(x))
kl_int <- function(p, q) {
    integrate(function(x) {
        pp <- p(x); qq <- q(x)
        ifelse(pp > 0, pp * log(pp / qq), 0)
    }, -20, 20)$value
}
jsd <- 0.5 * kl_int(p_data, mix) + 0.5 * kl_int(p_g, mix)
value_at_opt <- -log(4) + 2 * jsd

knitr::kable(
    data.frame(
        quantity = c("JS divergence", "value -log4 + 2*JSD", "-log 4"),
        value = round(c(jsd, value_at_opt, -log(4)), 4)
    ),
    col.names = c("Quantity", "Value"),
    caption = "GAN value at the optimal discriminator equals -log4 + 2 times JS divergence."
)

Table 42.2: GAN value at the optimal discriminator equals -log4 + 2 times JS divergence.

Quantity	Value
JS divergence	0.2532
value -log4 + 2*JSD	-0.8799
-log 4	-1.3863

When the two distributions differ, the JS divergence is positive and the generator pays a penalty above the floor of $-\log 4$. The generator minimizes this penalty by matching $p_{\mathrm{data}}$, at which point the JS divergence is zero and the value reaches $-\log 4$.

42.6 Comparison

Now that each family has been seen on its own, it helps to line them up against each other. They trade off sample quality, training stability, whether they give a likelihood, and sampling speed, and no single one wins on every axis. Table Table 42.3 lays out the trade-offs side by side.

Show code

cmp <- data.frame(
    Property = c(
        "Sample quality",
        "Training stability",
        "Likelihood",
        "Sampling speed",
        "Latent space"
    ),
    VAE = c(
        "moderate, often blurry",
        "stable",
        "lower bound (ELBO)",
        "fast, one decoder pass",
        "smooth and structured"
    ),
    GAN = c(
        "sharp, high quality",
        "unstable, mode collapse",
        "none, implicit",
        "fast, one generator pass",
        "usable but less regular"
    ),
    Diffusion = c(
        "very high quality",
        "stable, simple MSE loss",
        "lower bound, tractable",
        "slow, many denoising steps",
        "no compact latent code"
    )
)
knitr::kable(cmp, caption = "VAE, GAN, and diffusion at a glance.")

Table 42.3: VAE, GAN, and diffusion at a glance.

Property	VAE	GAN	Diffusion
Sample quality	moderate, often blurry	sharp, high quality	very high quality
Training stability	stable	unstable, mode collapse	stable, simple MSE loss
Likelihood	lower bound (ELBO)	none, implicit	lower bound, tractable
Sampling speed	fast, one decoder pass	fast, one generator pass	slow, many denoising steps
Latent space	smooth and structured	usable but less regular	no compact latent code

In short: VAEs give a clean probabilistic latent space and stable training but softer samples; GANs give sharp samples at the cost of fragile training; and diffusion models give the best samples with stable training but pay for it at sampling time because every sample needs many sequential steps.

When to use this

Reach for a VAE when you want a smooth, interpretable latent space or a likelihood-style score (anomaly detection, representation learning). Reach for a GAN when sharp samples matter more than coverage or stability. Reach for a diffusion model when sample quality is the priority and you can afford slower generation. These trade-offs shift as the field moves, so treat the table as a snapshot rather than a verdict.

42.7 Further Reading

Kingma and Welling (2014), Auto-Encoding Variational Bayes. The original VAE paper with the ELBO and the reparameterization trick.
Goodfellow et al. (2014), Generative Adversarial Nets. Introduces the minimax game and proves the optimal discriminator and JS divergence result.
Sohl-Dickstein et al. (2015), Deep Unsupervised Learning using Nonequilibrium Thermodynamics. The first diffusion model for generative modeling.
Ho et al. (2020), Denoising Diffusion Probabilistic Models. The simplified noise prediction objective used in modern diffusion models.

The full neural network versions need a GPU and the Keras/TensorFlow stack, so those code blocks are shown for reference but not executed here. The base R demonstration captures the key mathematical facts, the forward diffusion process and the optimal GAN discriminator, that you can verify on your own machine.↩︎
The Kullback Leibler divergence $D_{\mathrm{KL}}(q \| p)$ measures how far one distribution $q$ is from another $p$. It is zero when $q = p$ and positive otherwise, which is why a lower bound that differs from the target by a KL term can never overshoot.↩︎
The Jensen Shannon divergence is a smoothed, symmetric cousin of the KL divergence: $D_{\mathrm{JS}}(p \| q) = \tfrac12 D_{\mathrm{KL}}(p \| m) + \tfrac12 D_{\mathrm{KL}}(q \| m)$ with $m = \tfrac12(p + q)$. Unlike KL it is always finite and symmetric in its two arguments.↩︎

# Generative Models {#sec-generative-models} ```{r} #| include: false source("_common.R") ``` Imagine you have a folder of ten thousand photographs of faces and you want a computer to produce a brand new face that no one has ever seen, yet looks completely real. A classifier cannot do this. A classifier reads a photo and returns a label such as "smiling" or "not smiling". It never has to understand what makes a face a face. To invent a new one, the computer needs a model of what faces look like in general, not just a rule for telling two kinds apart. That distinction, between a model that *labels* data and a model that can *produce* data, is the subject of this chapter. Most of this book studies discriminative models. A discriminative model learns $p(y \mid x)$ directly: given an input $x$, it returns a label or a number. A generative model instead learns the data distribution itself, either the joint $p(x, y)$ or the marginal $p(x)$. Once you have a model of $p(x)$, you can do more than classify. You can draw new samples that look like the training data, score how likely a point is, fill in missing parts of an input, and compress data into a compact code. ::: {.callout-tip title="Intuition"} A discriminative model learns where to draw the line between categories. A generative model learns the whole landscape the data live on, so it can walk around and produce new points anywhere on that landscape. ::: This chapter covers three families that dominate modern generative modeling: variational autoencoders (VAEs), generative adversarial networks (GANs), and diffusion models. For each one you will see the intuition first, then the math that makes it precise, then a sketch of the code, and finally what the result means. The three families reach the same goal, a model you can sample from, by very different routes, and the closing comparison lays out the trade-offs side by side. The runnable demonstration near the end uses base R and ggplot2 only, so every number in it reproduces without a deep learning backend.^[The full neural network versions need a GPU and the Keras/TensorFlow stack, so those code blocks are shown for reference but not executed here. The base R demonstration captures the key mathematical facts, the forward diffusion process and the optimal GAN discriminator, that you can verify on your own machine.] ## Generative versus Discriminative Before meeting the three model families, it is worth pinning down precisely what "generative" means and why it is a different job from the classification and regression that occupy much of this book. The cleanest way in is to write the joint distribution two ways: $$ p(x, y) = p(y \mid x)\, p(x) = p(x \mid y)\, p(y). $$ A discriminative model targets the first factor $p(y \mid x)$ and ignores $p(x)$. Logistic regression, random forests, and the neural network classifiers from earlier chapters all sit here. They tend to predict well because they spend all of their capacity on the boundary between classes. A generative model targets $p(x \mid y)$ and $p(y)$, or just $p(x)$ when there are no labels. Classification then comes from Bayes rule, $$ p(y \mid x) = \frac{p(x \mid y)\, p(y)}{\sum_{y'} p(x \mid y')\, p(y')}. $$ Naive Bayes (@sec-naive-bayes) and Gaussian discriminant analysis (@sec-discriminant-analysis) are classic examples. Modeling $p(x)$ is harder than modeling a boundary, because the model has to account for everything in the data, not just the part that separates classes. So why pay that price? Classification is only one of several payoffs. A model of $p(x)$ lets you: - sample new data, for example images, audio, or molecules, - detect outliers as points with low $p(x)$ (@sec-anomaly-detection), - impute missing features, - learn a useful low dimensional representation, - augment a small training set with synthetic examples. ::: {.callout-note} None of the three families below require the labels $y$. They learn the shape of $p(x)$ from inputs alone, which is why they are usually trained on unlabeled data, of which the world has far more than labeled data. ::: The three families take different routes to a model you can sample from. We start with the one closest to ideas already in this book, the autoencoder. ## Variational Autoencoders The first family stays closest to familiar territory. A variational autoencoder (VAE) is a latent variable model: it assumes each observed point $x$ is generated from an unobserved, low dimensional code $z$ that we never see directly. The generative story is $$ z \sim p(z) = \mathcal{N}(0, I), \qquad x \sim p_\theta(x \mid z), $$ so the marginal likelihood of a data point integrates over the latent variable, $$ p_\theta(x) = \int p_\theta(x \mid z)\, p(z)\, dz . $$ Here $z$ is the *latent code*, a short vector that captures the essence of $x$ (say, a handful of numbers summarizing a face), and $p_\theta(x \mid z)$ is the *decoder*, the network with parameters $\theta$ that turns a code back into a data point. This connects to the autoencoders chapter (@sec-autoencoders). A plain autoencoder maps $x$ to a code and back with no probabilistic meaning. A VAE keeps the encoder and decoder shape but makes both stochastic, which lets us sample new $x$ by drawing $z \sim \mathcal{N}(0, I)$ and passing it through the decoder. ::: {.callout-important title="Key idea"} A plain autoencoder compresses and reconstructs. A VAE turns that same architecture into a generator by insisting the codes follow a known distribution, the standard normal, so you can sample a fresh code and decode it into a new, plausible $x$. ::: ### The problem with the integral There is a snag. To train the model we would like to maximize $p_\theta(x)$, but the integral over $z$ has no closed form for a neural network decoder, and the posterior $p_\theta(z \mid x) = p_\theta(x \mid z)\, p(z) / p_\theta(x)$ (which codes are likely to have produced a given $x$) is equally hard. The VAE works around both by introducing a second network, an encoder $q_\phi(z \mid x)$, that *approximates* the true posterior. The encoder outputs a mean and a variance, $$ q_\phi(z \mid x) = \mathcal{N}\!\big(\mu_\phi(x),\, \mathrm{diag}(\sigma_\phi^2(x))\big). $$ ### The evidence lower bound Because we cannot maximize $\log p_\theta(x)$ directly, we maximize a quantity that is guaranteed to sit just below it and is computable. Start from the log marginal likelihood and insert $q_\phi$: $$ \log p_\theta(x) = \log \int p_\theta(x \mid z)\, p(z)\, dz = \log \mathbb{E}_{q_\phi(z \mid x)}\!\left[\frac{p_\theta(x \mid z)\, p(z)}{q_\phi(z \mid x)}\right]. $$ Jensen inequality moves the log inside the expectation and gives a lower bound, the evidence lower bound or ELBO: $$ \log p_\theta(x) \ge \mathbb{E}_{q_\phi(z \mid x)}\big[\log p_\theta(x \mid z)\big] - D_{\mathrm{KL}}\!\big(q_\phi(z \mid x)\,\|\,p(z)\big) =: \mathcal{L}(\theta, \phi; x). $$ It is worth deriving the same bound a second way, by an exact identity rather than an inequality, because the algebra reveals precisely what is lost. For any $q_\phi(z \mid x)$ with the same support as the posterior, multiply and divide inside the log and split the expectation: $$ \begin{aligned} \log p_\theta(x) &= \mathbb{E}_{q_\phi(z \mid x)}\big[\log p_\theta(x)\big] \\ &= \mathbb{E}_{q_\phi}\!\left[\log \frac{p_\theta(x, z)}{p_\theta(z \mid x)}\right] \\ &= \mathbb{E}_{q_\phi}\!\left[\log \frac{p_\theta(x, z)}{q_\phi(z \mid x)}\right] + \mathbb{E}_{q_\phi}\!\left[\log \frac{q_\phi(z \mid x)}{p_\theta(z \mid x)}\right] \\ &= \mathcal{L}(\theta, \phi; x) + D_{\mathrm{KL}}\!\big(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x)\big), \end{aligned} $$ {#eq-generative-models-elbo-decomp} where the first line uses that $\log p_\theta(x)$ does not depend on $z$, the second uses $p_\theta(x) = p_\theta(x, z) / p_\theta(z \mid x)$, and the last line recognizes $\mathbb{E}_{q_\phi}[\log p_\theta(x, z) - \log q_\phi] = \mathcal{L}$ once $p_\theta(x, z) = p_\theta(x \mid z)\, p(z)$ is expanded. Equation @eq-generative-models-elbo-decomp is exact: the log evidence equals the ELBO plus a non negative KL term, so the ELBO is a lower bound, and the inequality is tight exactly when $q_\phi$ equals the true posterior. The gap between $\log p_\theta(x)$ and the ELBO is exactly $D_{\mathrm{KL}}(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x))$, which is non negative,^[The Kullback Leibler divergence $D_{\mathrm{KL}}(q \| p)$ measures how far one distribution $q$ is from another $p$. It is zero when $q = p$ and positive otherwise, which is why a lower bound that differs from the target by a KL term can never overshoot.] so maximizing $\mathcal{L}$ does two good things at once: it pushes the bound up and it makes the approximate posterior close to the true one. Read the two terms as a loss. The first term is a reconstruction term: it rewards a decoder that assigns high probability to the actual $x$ given its code. The second term is a Kullback Leibler penalty that pulls the encoded distribution toward the prior $\mathcal{N}(0, I)$. The KL term acts as a regularizer. Without it the encoder could place every input in its own tiny region of latent space, which would reconstruct well but leave gaps that decode to nothing. The penalty keeps the latent space packed near the prior so that random draws decode into plausible samples. ::: {.callout-tip title="Intuition"} Picture the reconstruction term as "remember the data" and the KL term as "stay tidy". Pull only on the first and the model memorizes inputs into scattered, unusable codes. Pull only on the second and every input collapses to the same blurry average. The ELBO balances the two. ::: When both $q_\phi(z \mid x)$ and $p(z)$ are Gaussian the KL term has a closed form, so there is nothing to estimate by sampling. For a $d$ dimensional diagonal Gaussian, $$ D_{\mathrm{KL}}\!\big(\mathcal{N}(\mu, \sigma^2 I)\,\|\,\mathcal{N}(0, I)\big) = \tfrac{1}{2} \sum_{j=1}^{d} \big(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\big). $$ {#eq-generative-models-gauss-kl} The closed form is worth deriving once, because it explains why each term has the sign it does. The KL divergence between two univariate Gaussians $\mathcal{N}(\mu_1, \sigma_1^2)$ and $\mathcal{N}(\mu_2, \sigma_2^2)$ is $$ D_{\mathrm{KL}} = \mathbb{E}_{q}\!\left[\log \frac{q(z)}{p(z)}\right] = \mathbb{E}_q\!\left[\log\frac{\sigma_2}{\sigma_1} - \frac{(z - \mu_1)^2}{2\sigma_1^2} + \frac{(z - \mu_2)^2}{2\sigma_2^2}\right]. $$ Under $q$ we have $\mathbb{E}_q[(z - \mu_1)^2] = \sigma_1^2$ and $\mathbb{E}_q[(z - \mu_2)^2] = \sigma_1^2 + (\mu_1 - \mu_2)^2$, so the expectation collapses to $$ D_{\mathrm{KL}} = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac12 . $$ Setting the prior to the standard normal, $\mu_2 = 0$ and $\sigma_2 = 1$, and summing over the $d$ independent coordinates of the diagonal posterior gives @eq-generative-models-gauss-kl. The quadratic terms $\mu_j^2$ and $\sigma_j^2$ penalize codes that drift from the origin or spread too wide, while the $-\log \sigma_j^2$ term forbids the variance from collapsing to zero, which would make the posterior a point mass and the latent space unusable for sampling. ::: {.callout-warning title="Posterior collapse"} A characteristic VAE failure is posterior collapse, where for some latent coordinates the encoder ignores the input and returns $q_\phi(z_j \mid x) \approx p(z_j) = \mathcal{N}(0, 1)$, driving the KL term in @eq-generative-models-gauss-kl to zero. The decoder then learns to reconstruct without using those coordinates, so they carry no information. This happens most often when the decoder is powerful enough to model $x$ on its own (for example an autoregressive decoder) and early in training when the reconstruction term is weak. Standard remedies are KL annealing (warm up the KL weight from zero), free bits (clamp each coordinate's KL at a small floor $\lambda$ so the optimizer stops shrinking it once it is informative enough), or the $\beta$-VAE weighting below. ::: ### Amortized inference and the $\beta$-VAE The encoder $q_\phi(z \mid x)$ deserves a second look. Classical variational inference fits a separate set of variational parameters $(\mu_i, \sigma_i)$ for every data point $x_i$, solving an optimization per example. The VAE instead trains a single network that maps any $x$ to its $(\mu_\phi(x), \sigma_\phi(x))$, sharing parameters across the whole dataset. This is amortized inference: the cost of inference is amortized over the dataset by paying once to train the encoder, after which a new point's approximate posterior is a single forward pass. The price is the amortization gap, the extra slack between the ELBO achieved by the shared encoder and the best ELBO a per example optimizer could reach. Together with the approximation gap from $q_\phi$ being Gaussian rather than the true posterior, this is why the ELBO can sit well below $\log p_\theta(x)$. The full training objective is the ELBO averaged over the data, and the gradient with respect to $\theta$ and $\phi$ is estimated by drawing a minibatch and one reparameterized $\epsilon$ per example. A widely used generalization scales the KL term by a constant $\beta$, $$ \mathcal{L}_\beta(\theta, \phi; x) = \mathbb{E}_{q_\phi(z \mid x)}\big[\log p_\theta(x \mid z)\big] - \beta\, D_{\mathrm{KL}}\!\big(q_\phi(z \mid x)\,\|\,p(z)\big). $$ {#eq-generative-models-beta-vae} With $\beta = 1$ this is the exact ELBO. Setting $\beta > 1$ presses the aggregated posterior harder toward the factorized prior, which tends to produce more disentangled latent coordinates at the cost of reconstruction fidelity, while $\beta < 1$ favors sharper reconstructions and risks the scattered, gap riddled latent space the KL term is meant to prevent. The reconstruction term has its own implicit weight: with a Gaussian decoder of fixed variance $\sigma_x^2$, the negative log likelihood is $\|x - \hat x\|^2 / (2\sigma_x^2)$ up to a constant, so $\sigma_x^2$ acts as an inverse weight on reconstruction relative to the KL term. This is why a plain mean squared error decoder with no explicit variance implicitly fixes a particular trade-off. ### The reparameterization trick To train with gradient descent we need gradients of $\mathcal{L}$ with respect to the encoder parameters $\phi$. The trouble is that the expectation is taken over a distribution that depends on $\phi$, so we cannot just sample $z$ and backpropagate through the sampling step. The reparameterization trick rewrites the sample as a deterministic function of $\phi$ plus external noise, $$ z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I), $$ where $\odot$ is elementwise multiplication. Now the randomness lives in $\epsilon$, which does not depend on $\phi$, and the path from $\phi$ to $z$ is differentiable. The Monte Carlo estimate of the ELBO using one or a few draws of $\epsilon$ then has a well defined gradient. To see precisely why this matters, compare two ways of differentiating an expectation $\mathbb{E}_{q_\phi(z)}[f(z)]$ whose distribution depends on $\phi$. The score function or REINFORCE estimator uses the identity $\nabla_\phi q_\phi = q_\phi\, \nabla_\phi \log q_\phi$ to write $$ \nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{q_\phi}\!\big[f(z)\, \nabla_\phi \log q_\phi(z)\big], $$ which is unbiased and needs only $\log q_\phi$, not a differentiable $f$, but is notoriously high variance because $f(z)$ multiplies the score with no cancellation. The reparameterization estimator instead pushes the parameter inside a deterministic map $z = g_\phi(\epsilon)$ with $\epsilon \sim p(\epsilon)$ fixed, giving $$ \nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{p(\epsilon)}\!\big[\nabla_\phi f(g_\phi(\epsilon))\big] = \mathbb{E}_{p(\epsilon)}\!\big[f'(z)\, \nabla_\phi g_\phi(\epsilon)\big]. $$ The Jacobian $\nabla_\phi g_\phi$ couples the gradient of $f$ directly to the parameter, so the estimator exploits the smoothness of the decoder and typically has far lower variance, often by orders of magnitude. The cost is that it requires $f$ (here the integrand of the ELBO) to be differentiable in $z$, which holds for continuous data and a Gaussian $q_\phi$ but fails for discrete $z$, where one falls back on the score function estimator or a continuous relaxation such as the Gumbel-Softmax. ::: {.callout-important title="Key idea"} You cannot differentiate through "draw a random sample", but you can differentiate through "scale and shift a fixed random sample". The reparameterization trick is exactly that move, and it is what makes the VAE trainable by ordinary backpropagation. ::: The next block sketches a VAE in Keras. It is shown for reference and is marked `eval = FALSE` because it needs the Keras/TensorFlow backend, which is not available in this build. ```{r vae-keras, eval = FALSE} library(keras) latent_dim <- 2 # Encoder outputs mean and log variance of q(z | x) encoder_in <- layer_input(shape = 784) h <- encoder_in %>% layer_dense(256, activation = "relu") z_mean <- h %>% layer_dense(latent_dim) z_log_var <- h %>% layer_dense(latent_dim) # Reparameterization: z = mean + exp(log_var / 2) * epsilon sampling <- function(args) { z_mean <- args[[1]] z_log_var <- args[[2]] eps <- k_random_normal(shape = k_shape(z_mean)) z_mean + k_exp(0.5 * z_log_var) * eps } z <- list(z_mean, z_log_var) %>% layer_lambda(sampling) # Decoder maps z back to x decoder <- keras_model_sequential() %>% layer_dense(256, activation = "relu", input_shape = latent_dim) %>% layer_dense(784, activation = "sigmoid") x_decoded <- decoder(z) vae <- keras_model(encoder_in, x_decoded) # Loss = reconstruction + KL regularizer recon <- 784 * loss_binary_crossentropy(encoder_in, x_decoded) kl <- -0.5 * k_sum(1 + z_log_var - k_square(z_mean) - k_exp(z_log_var), axis = -1L) vae$add_loss(k_mean(recon + kl)) vae %>% compile(optimizer = "adam") ``` ## Generative Adversarial Networks The VAE works hard to keep an explicit, if approximate, handle on the likelihood. A generative adversarial network (GAN) throws that away and asks a different question: instead of "how probable is this data point?", it asks "can anyone tell my fakes from the real thing?" Think of an art forger and a detective. The forger (the generator) paints fakes; the detective (the discriminator) tries to spot them. Each makes the other better. When the forger finally wins, the fakes are indistinguishable from genuine work. Concretely, two networks compete. A generator $G$ maps noise $z \sim p(z)$ to a fake sample $G(z)$. A discriminator $D$ takes a sample and outputs the probability that it is real. Training is a minimax game, $$ \min_G \max_D \; \mathbb{E}_{x \sim p_{\mathrm{data}}}\big[\log D(x)\big] + \mathbb{E}_{z \sim p_z}\big[\log\big(1 - D(G(z))\big)\big]. $$ The discriminator wants to push $D(x)$ toward one on real data and toward zero on fakes. The generator wants the opposite, to make $D(G(z))$ close to one. The $\min_G \max_D$ notation reads literally: the inner $\max_D$ trains the best detective it can, and the outer $\min_G$ trains the forger to defeat that best detective. ::: {.callout-note} A GAN never writes down or evaluates $p(x)$. It is an *implicit* generative model: it can hand you samples but cannot tell you how likely any particular point is. That is the price of skipping the likelihood. ::: ### The optimal discriminator It helps to ask what the perfect detective would do, holding the forger fixed. Fix the generator, so the distribution of fakes $p_g$ is fixed. Write the value function as a single integral over $x$, $$ V(D) = \int_x \Big( p_{\mathrm{data}}(x) \log D(x) + p_g(x) \log\big(1 - D(x)\big) \Big)\, dx . $$ For each $x$ the integrand has the form $a \log d + b \log(1 - d)$, which is maximized at $d = a / (a + b)$. The optimal discriminator is therefore the density ratio $$ D^*(x) = \frac{p_{\mathrm{data}}(x)}{p_{\mathrm{data}}(x) + p_g(x)} . $$ Substitute $D^*$ back into the value function. After collecting terms it becomes $$ V(G, D^*) = -\log 4 + 2\, D_{\mathrm{JS}}\!\big(p_{\mathrm{data}} \,\|\, p_g\big), $$ {#eq-generative-models-gan-js} The collecting of terms is short enough to show. Write $m = \tfrac12(p_{\mathrm{data}} + p_g)$ for the mixture. Substituting $D^* = p_{\mathrm{data}} / (2m)$ and $1 - D^* = p_g / (2m)$ gives $$ \begin{aligned} V(G, D^*) &= \mathbb{E}_{p_{\mathrm{data}}}\!\left[\log \frac{p_{\mathrm{data}}}{2m}\right] + \mathbb{E}_{p_g}\!\left[\log \frac{p_g}{2m}\right] \\ &= \mathbb{E}_{p_{\mathrm{data}}}\!\left[\log \frac{p_{\mathrm{data}}}{m}\right] + \mathbb{E}_{p_g}\!\left[\log \frac{p_g}{m}\right] - 2\log 2 \\ &= D_{\mathrm{KL}}(p_{\mathrm{data}} \,\|\, m) + D_{\mathrm{KL}}(p_g \,\|\, m) - 2\log 2 \\ &= 2\, D_{\mathrm{JS}}(p_{\mathrm{data}} \,\|\, p_g) - \log 4, \end{aligned} $$ using $D_{\mathrm{JS}}(p \| q) = \tfrac12 D_{\mathrm{KL}}(p \| m) + \tfrac12 D_{\mathrm{KL}}(q \| m)$ in the last line. This is @eq-generative-models-gan-js. Because $D_{\mathrm{JS}} \ge 0$ with equality iff $p_g = p_{\mathrm{data}}$, the outer minimization over $G$ drives $p_g$ to the data distribution and the value to its floor $-\log 4$. where $D_{\mathrm{JS}}$ is the Jensen Shannon divergence, a symmetric measure of how far two distributions are apart.^[The Jensen Shannon divergence is a smoothed, symmetric cousin of the KL divergence: $D_{\mathrm{JS}}(p \| q) = \tfrac12 D_{\mathrm{KL}}(p \| m) + \tfrac12 D_{\mathrm{KL}}(q \| m)$ with $m = \tfrac12(p + q)$. Unlike KL it is always finite and symmetric in its two arguments.] Since the JS divergence is zero only when $p_g = p_{\mathrm{data}}$, the generator that wins the game matches the data distribution exactly, with the value reaching its minimum of $-\log 4$. ::: {.callout-important title="Key idea"} Underneath the forger-versus-detective story, the GAN is secretly minimizing a divergence between the fake and real distributions. The adversarial game is just a clever, sample-only way to estimate and descend that divergence without ever computing a density. ::: The demonstration below evaluates $D^*(x)$ for two Gaussians so the density ratio result is visible as a concrete curve. ### Mode collapse and instability The clean theory above assumes both networks are trained to optimality at each step, which never happens in practice. Two failures are common, and knowing their names is half of diagnosing a misbehaving GAN. The first is *mode collapse*: the generator discovers a few outputs that reliably fool the current discriminator and produces only those, ignoring whole regions of the data. The samples look fine one at a time but lack variety, like a forger who only knows how to paint one convincing painting. The second is *training instability*: the two networks chase a moving target. If the discriminator gets too strong, the term $\log(1 - D(G(z)))$ saturates and the generator gradient vanishes, so the forger stops learning precisely when it is losing badly. A common fix is to have the generator maximize $\log D(G(z))$ instead, which gives stronger gradients early in training. Balancing the two networks, tuning learning rates, and choosing alternative losses such as the Wasserstein objective are all standard parts of getting a GAN to converge. The saturation is easy to make precise. Let $u = D(G(z))$ be the discriminator's probability that a fake is real, small early in training when fakes are obvious. The original generator loss $\log(1 - u)$ has derivative $-1/(1 - u)$, which is near $-1$ when $u \approx 0$: a tiny, nearly constant gradient exactly when the generator most needs a push. The non saturating surrogate maximizes $\log u$, whose derivative $1/u$ blows up as $u \to 0$, supplying a large gradient when the fakes are poor. Both losses share the same fixed point but have opposite gradient behavior in the regime that matters early on, which is why the non saturating form is the practical default. ### Why JS divergence is a brittle objective The JS divergence result @eq-generative-models-gan-js is elegant but also diagnoses the deeper instability. When $p_{\mathrm{data}}$ and $p_g$ live on low dimensional manifolds in a high dimensional space, as natural images do, their supports are almost surely disjoint or overlap on a set of measure zero early in training. On disjoint supports a perfect discriminator exists, $D_{\mathrm{JS}}$ is pinned at its maximum $\log 2$ regardless of how far apart the two distributions are, and its gradient with respect to the generator is zero almost everywhere. The generator receives no signal about which direction would move $p_g$ toward $p_{\mathrm{data}}$. This is the manifold view of why a too strong discriminator kills learning. The Wasserstein GAN replaces JS with the earth mover (Wasserstein-1) distance, which by Kantorovich-Rubinstein duality equals $$ W(p_{\mathrm{data}}, p_g) = \sup_{\|f\|_L \le 1} \; \mathbb{E}_{p_{\mathrm{data}}}[f(x)] - \mathbb{E}_{p_g}[f(x)], $$ {#eq-generative-models-wgan} where the supremum is over all 1-Lipschitz functions $f$. Unlike JS, $W$ varies smoothly with the distance between $p_g$ and $p_{\mathrm{data}}$ even when their supports do not overlap, so it provides a usable gradient everywhere. In practice $f$ (now called a critic, with no sigmoid and a real valued output) is constrained to be Lipschitz by weight clipping or, more reliably, a gradient penalty that pushes $\|\nabla_x f\|$ toward one. The critic loss in @eq-generative-models-wgan also tracks sample quality, which the original GAN loss does not, so it partially answers the warning below. ::: {.callout-warning} Unlike the VAE loss, a GAN loss curve does not tell you much. The generator and discriminator losses can stay flat while sample quality improves or collapses. Judge a GAN by looking at its samples, not by watching the loss go down. ::: A minimal GAN training loop in Keras follows. It is marked `eval = FALSE` because it needs the Keras/TensorFlow backend. ```{r gan-keras, eval = FALSE} library(keras) generator <- keras_model_sequential() %>% layer_dense(128, activation = "relu", input_shape = 100) %>% layer_dense(784, activation = "tanh") discriminator <- keras_model_sequential() %>% layer_dense(128, activation = "relu", input_shape = 784) %>% layer_dense(1, activation = "sigmoid") discriminator %>% compile(optimizer = "adam", loss = "binary_crossentropy") # Combined model trains the generator with the discriminator frozen freeze_weights(discriminator) gan_in <- layer_input(shape = 100) gan_out <- gan_in %>% generator() %>% discriminator() gan <- keras_model(gan_in, gan_out) gan %>% compile(optimizer = "adam", loss = "binary_crossentropy") # Each step: update D on real plus fake, then update G through the frozen D. ``` ## Diffusion Models Here is a different idea entirely. Take a clear photo and add a tiny bit of static noise, then a little more, then more, until after many steps nothing is left but television snow. That destruction is easy and needs no learning. Now suppose you could run the tape backward, removing a little noise at each step. If you could learn that one denoising step, you could start from pure static and walk all the way back to a brand new clear photo. That is the whole idea of a diffusion model. Diffusion models generate data by learning to reverse a gradual noising process. The idea goes back to Sohl-Dickstein et al. (2015) and was turned into a practical image generator by Ho et al. (2020). The forward process slowly destroys structure in the data by adding Gaussian noise over $T$ steps. The reverse process, learned by a network, removes noise one step at a time, turning pure noise back into a sample. ::: {.callout-tip title="Intuition"} Noising is trivial and requires no model. The model only has to learn the *reverse*, and because each reverse step undoes just a sliver of noise, it is a much easier task than producing a whole image in one shot. ::: ### Forward noising process Fix a variance schedule $\beta_1, \dots, \beta_T$ with small positive values. The forward process is a Markov chain that adds a little noise at each step, $$ q(x_t \mid x_{t-1}) = \mathcal{N}\!\big(\sqrt{1 - \beta_t}\, x_{t-1},\; \beta_t I\big). $$ The scaling by $\sqrt{1 - \beta_t}$ shrinks the previous value just enough that the variance stays controlled as noise accumulates, rather than growing without bound. After many steps $x_T$ is essentially standard Gaussian noise with all structure gone. ### Closed form for any step Simulating the chain step by step would be slow during training. A convenient property rescues us: you can jump from the clean data $x_0$ to any noised version $x_t$ in a single shot, without running the intermediate steps. Let $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$. Then $$ q(x_t \mid x_0) = \mathcal{N}\!\big(\sqrt{\bar{\alpha}_t}\, x_0,\; (1 - \bar{\alpha}_t) I\big), $$ which can be written with an explicit noise variable, $$ x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I). $$ As $t$ grows, $\bar{\alpha}_t \to 0$, so the signal $\sqrt{\bar{\alpha}_t}\, x_0$ fades and the noise term takes over. The closed form follows from composing two Gaussian steps. Write $x_t = \sqrt{\alpha_t}\, x_{t-1} + \sqrt{1 - \alpha_t}\, \epsilon_{t-1}$ with independent $\epsilon_{t-1} \sim \mathcal{N}(0, I)$. Substitute the same identity for $x_{t-1}$: $$ x_t = \sqrt{\alpha_t \alpha_{t-1}}\, x_{t-2} + \sqrt{\alpha_t (1 - \alpha_{t-1})}\, \epsilon_{t-2} + \sqrt{1 - \alpha_t}\, \epsilon_{t-1}. $$ The last two terms are independent zero mean Gaussians, so they sum to a single Gaussian whose variance is the sum of variances, $\alpha_t(1 - \alpha_{t-1}) + (1 - \alpha_t) = 1 - \alpha_t \alpha_{t-1}$. Hence $x_t = \sqrt{\alpha_t \alpha_{t-1}}\, x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}}\, \bar\epsilon$ with $\bar\epsilon \sim \mathcal{N}(0, I)$. Iterating this merging of Gaussians all the way down to $x_0$ replaces the product of step coefficients with $\bar\alpha_t = \prod_{s=1}^t \alpha_s$ and leaves total variance $1 - \bar\alpha_t$, which is exactly the stated closed form. The choice of scaling by $\sqrt{1 - \beta_t}$ rather than any other constant is what makes the noised variance telescope to $1 - \bar\alpha_t$ and keeps $\mathrm{Var}(x_t)$ bounded by one when the data are standardized. ::: {.callout-tip} This closed form is also what makes the runnable demonstration cheap. To show data dissolving at step $t$, we never simulate $t$ separate noising steps. We apply the one formula above once. ::: The simulation later in the chapter uses exactly this closed form to push two dimensional data toward noise. ### Reverse denoising model To generate, we run the chain backward. The reverse step is also modeled as a Gaussian, $$ p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\big(\mu_\theta(x_t, t),\; \Sigma_\theta(x_t, t)\big), $$ with the mean produced by a network. Training the reverse process can be derived as a variational bound similar to the ELBO, but Ho et al. (2020) showed a simpler reparameterization. Rather than predicting the mean directly, the network predicts the noise $\epsilon$ that was added. Given the closed form above, a network $\epsilon_\theta(x_t, t)$ is trained to recover that noise, and the training objective collapses to a plain regression on the noise, $$ \mathcal{L}_{\mathrm{simple}}(\theta) = \mathbb{E}_{x_0, \epsilon, t}\Big[\big\| \epsilon - \epsilon_\theta\big(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon,\; t\big)\big\|^2\Big]. $$ {#eq-generative-models-simple-loss} ### From the variational bound to the simple loss The simple objective is not an ad hoc choice, it is a reweighted version of a genuine ELBO. The variational bound on the negative log likelihood factors over the chain into three groups of terms, $$ -\log p_\theta(x_0) \le \mathbb{E}_q\Big[ \underbrace{D_{\mathrm{KL}}(q(x_T \mid x_0) \,\|\, p(x_T))}_{L_T} + \sum_{t > 1} \underbrace{D_{\mathrm{KL}}(q(x_{t-1} \mid x_t, x_0) \,\|\, p_\theta(x_{t-1} \mid x_t))}_{L_{t-1}} \underbrace{- \log p_\theta(x_0 \mid x_1)}_{L_0} \Big], $$ {#eq-generative-models-diff-elbo} obtained by the same multiply-and-divide trick used for the VAE, now applied to the joint over the whole trajectory $x_{1:T}$. The term $L_T$ has no learnable parameters (the forward process is fixed and $x_T$ is essentially the prior $\mathcal{N}(0, I)$), so it is a constant. Each $L_{t-1}$ compares the learned reverse step against the forward posterior $q(x_{t-1} \mid x_t, x_0)$, which, crucially, is Gaussian in closed form. By Bayes rule applied to the Gaussian forward steps, $$ q(x_{t-1} \mid x_t, x_0) = \mathcal{N}\!\big(\tilde\mu_t(x_t, x_0),\; \tilde\beta_t I\big), \qquad \tilde\beta_t = \frac{1 - \bar\alpha_{t-1}}{1 - \bar\alpha_t}\,\beta_t, $$ with posterior mean $$ \tilde\mu_t(x_t, x_0) = \frac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1 - \bar\alpha_t}\, x_0 + \frac{\sqrt{\alpha_t}\,(1 - \bar\alpha_{t-1})}{1 - \bar\alpha_t}\, x_t . $$ Because both distributions in $L_{t-1}$ are Gaussian with the same (fixed) covariance, the KL reduces to a scaled squared distance between their means, $L_{t-1} = \tfrac{1}{2\tilde\beta_t}\|\tilde\mu_t - \mu_\theta(x_t, t)\|^2 + \text{const}$. Now substitute the closed form $x_0 = (x_t - \sqrt{1 - \bar\alpha_t}\,\epsilon)/\sqrt{\bar\alpha_t}$ into $\tilde\mu_t$ and parameterize the network's mean in the matching form $$ \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\Big(x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}}\, \epsilon_\theta(x_t, t)\Big). $$ {#eq-generative-models-mu-eps} The two means then differ only through $\epsilon - \epsilon_\theta$, and the squared difference becomes $$ L_{t-1} = \frac{\beta_t^2}{2\tilde\beta_t\,\alpha_t (1 - \bar\alpha_t)} \big\|\epsilon - \epsilon_\theta(x_t, t)\big\|^2 + \text{const}. $$ {#eq-generative-models-Lt-weighted} This is the per timestep noise prediction loss, weighted by a $t$ dependent factor. Ho et al. (2020) found that simply dropping that factor (setting it to one) and averaging over $t$ uniformly, which yields exactly @eq-generative-models-simple-loss, trains better and is the form used in practice. The reweighting downweights the very small $t$ terms, which correspond to imperceptible high frequency detail, and lets the network spend capacity on the harder, more visually important denoising at larger $t$. In words, the network is shown a noised point and the timestep it came from, and its only job is to guess the noise that was mixed in. Sampling then alternates a denoising step using $\epsilon_\theta$ with a small injection of fresh noise, from $t = T$ down to $t = 1$. The sampling step is the reverse parameterization @eq-generative-models-mu-eps made concrete. Starting from $x_T \sim \mathcal{N}(0, I)$, each step computes $$ x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\Big(x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}}\, \epsilon_\theta(x_t, t)\Big) + \sigma_t\, z, \qquad z \sim \mathcal{N}(0, I), $$ {#eq-generative-models-sampler} with $z = 0$ on the final step. The injected variance is typically $\sigma_t^2 = \tilde\beta_t$ or the simpler $\sigma_t^2 = \beta_t$; both are consistent with the bound and give similar samples. The fresh noise $\sigma_t z$ is essential: without it the chain would collapse to the conditional mean at every step and produce oversmoothed, low diversity samples, the diffusion analogue of the blurry VAE failure. ### Connection to score matching The noise prediction network is, up to scaling, an estimate of the score of the noised data, the gradient of the log density. From the closed form $x_t = \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1 - \bar\alpha_t}\,\epsilon$, the score of the marginal satisfies $\nabla_{x_t} \log q(x_t \mid x_0) = -\epsilon / \sqrt{1 - \bar\alpha_t}$, so a network trained to predict $\epsilon$ is equivalently learning $\epsilon_\theta(x_t, t) \approx -\sqrt{1 - \bar\alpha_t}\, \nabla_{x_t} \log q(x_t)$. This is the bridge to the score based view of Song and Ermon, in which sampling is Langevin dynamics that follows the learned score uphill in density, and to the continuous time formulation where the forward chain is a stochastic differential equation and sampling solves its reverse SDE (or a deterministic probability flow ODE). The ODE view is the basis of fast deterministic samplers such as DDIM, which reduce the hundreds of sequential steps of @eq-generative-models-sampler to a few dozen by taking larger, non Markovian strides. ::: {.callout-note title="Why sampling is the bottleneck"} Training cost is one network evaluation per example per step, comparable to other deep models. The expense is at generation: producing a single sample requires $T$ sequential forward passes through $\epsilon_\theta$ (often $T = 1000$), each depending on the previous, with no parallelism across steps. This is the fundamental reason diffusion sampling is orders of magnitude slower than a VAE or GAN, which need one pass, and why distillation and ODE solvers that cut the step count are an active area. ::: ::: {.callout-important title="Key idea"} The forbidding-looking diffusion objective is, in the end, plain regression: predict the noise with mean squared error. That stability is a big part of why diffusion models train more reliably than GANs, which have no single well-behaved loss to descend. ::: A sketch of the noise prediction loss in Keras follows. It is marked `eval = FALSE` because it depends on the Keras backend. ```{r diffusion-keras, eval = FALSE} library(keras) # eps_model(x_t, t) predicts the noise added to reach x_t. # Training: sample x0 from data, t uniform in 1..T, epsilon ~ N(0, I), # form x_t = sqrt(abar[t]) * x0 + sqrt(1 - abar[t]) * epsilon, # then minimize mean((epsilon - eps_model(x_t, t))^2). train_step <- function(x0, abar, T) { t <- sample.int(T, nrow(x0), replace = TRUE) eps <- matrix(rnorm(length(x0)), nrow = nrow(x0)) a <- sqrt(abar[t]) b <- sqrt(1 - abar[t]) x_t <- a * x0 + b * eps # loss = mean((eps - eps_model(x_t, t))^2); backprop with optimizer } ``` ## Runnable Demonstration The neural network versions above need a GPU and a deep learning backend, but two of the central mathematical facts can be checked exactly with base R and ggplot2 on a laptop. This section does two things. First it runs the forward diffusion process on a two dimensional Gaussian mixture and watches the data dissolve into noise, confirming the closed form from the diffusion section. Second it evaluates the optimal GAN discriminator for two one dimensional Gaussians and confirms the density ratio formula $D^*(x) = p_{\mathrm{data}} / (p_{\mathrm{data}} + p_g)$. ::: {.callout-tip title="When to use this"} Treat the demonstration as a sanity check on the theory, not as a recipe for generating images. It isolates the math, the noising schedule and the optimal discriminator, in a setting small enough to verify by eye and by number. ::: ```{r gm-setup} library(ggplot2) set.seed(1) ``` ### Forward diffusion on a 2-D mixture We sample a small mixture of three Gaussians, then apply the closed form $x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon$ at a few timesteps. @fig-generative-models-diffusion-forward shows the result: the three clear modes at $t = 0$ dissolve toward a single noise cloud as $t$ increases. ```{r fig-generative-models-diffusion-forward, fig.width = 7, fig.height = 5, fig.cap = "Forward diffusion applied to a two dimensional Gaussian mixture. The three modes visible at step zero degrade into standard Gaussian noise by step forty as the closed form mixes in progressively more noise."} # Sample a 2-D Gaussian mixture as the data distribution p(x0) n <- 600 centers <- matrix(c(-3, 0, 3, 0, 0, 3), ncol = 2, byrow = TRUE) comp <- sample(1:3, n, replace = TRUE) x0 <- centers[comp, ] + matrix(rnorm(2 * n, sd = 0.35), ncol = 2) # Linear variance schedule and cumulative product alpha_bar T_steps <- 40 beta <- seq(1e-4, 0.06, length.out = T_steps) alpha <- 1 - beta alpha_bar <- cumprod(alpha) # Apply the closed form at selected steps steps_to_show <- c(0, 5, 15, 40) make_frame <- function(t) { if (t == 0) { xt <- x0 } else { ab <- alpha_bar[t] eps <- matrix(rnorm(2 * n), ncol = 2) xt <- sqrt(ab) * x0 + sqrt(1 - ab) * eps } data.frame(x = xt[, 1], y = xt[, 2], step = paste0("t = ", t)) } frames <- do.call(rbind, lapply(steps_to_show, make_frame)) frames$step <- factor(frames$step, levels = paste0("t = ", steps_to_show)) ggplot(frames, aes(x, y)) + geom_point(alpha = 0.4, size = 0.8, color = "steelblue") + facet_wrap(~step, nrow = 1) + coord_equal() + labs( title = "Forward diffusion: structured data degrading to noise", x = NULL, y = NULL ) ``` At $t = 0$ the three modes are clear. By $t = 40$ the cumulative product $\bar\alpha_t$ is near zero, so the signal is gone and the points look like standard Gaussian noise. To quantify the loss of structure, we report two summaries per step: the total variance of the cloud, and a Kullback Leibler divergence between the marginal of the first coordinate at step $t$ and a standard normal, estimated from histograms. Both should move toward the noise reference as $t$ grows. Table @tbl-generative-models-diffusion-decay reports these summaries at selected steps. ```{r tbl-generative-models-diffusion-decay} # KL(p || q) for two discrete histograms on a common grid kl_hist <- function(samp, ref_density, breaks) { p <- hist(samp, breaks = breaks, plot = FALSE)$density mids <- head(breaks, -1) + diff(breaks) / 2 q <- ref_density(mids) p <- p / sum(p) q <- q / sum(q) keep <- p > 0 & q > 0 sum(p[keep] * log(p[keep] / q[keep])) } grid <- seq(-6, 6, length.out = 41) ref <- function(m) dnorm(m) summ <- lapply(0:T_steps, function(t) { if (t == 0) { xt <- x0 } else { ab <- alpha_bar[t] eps <- matrix(rnorm(2 * n), ncol = 2) xt <- sqrt(ab) * x0 + sqrt(1 - ab) * eps } data.frame( step = t, total_var = round(sum(apply(xt, 2, var)), 3), kl_to_normal = round(kl_hist(xt[, 1], ref, grid), 3) ) }) diffusion_tab <- do.call(rbind, summ) diffusion_tab <- diffusion_tab[diffusion_tab$step %in% c(0, 5, 10, 20, 30, 40), ] knitr::kable( diffusion_tab, row.names = FALSE, col.names = c("Step t", "Total variance", "KL of x1 to N(0,1)"), caption = "Structure decays as the forward process adds noise." ) ``` The KL divergence to a standard normal falls steeply over the first several steps and approaches zero, confirming that the marginal becomes standard Gaussian as the chain runs. ### Verifying the forward posterior mean The derivation of the simple loss hinged on the forward posterior $q(x_{t-1} \mid x_t, x_0)$ being Gaussian with mean $\tilde\mu_t(x_t, x_0)$. We can confirm that mean numerically without any network. Fix $x_0$ and a step $t$, generate many pairs $(x_{t-1}, x_t)$ from the forward chain, then estimate $\mathbb{E}[x_{t-1} \mid x_t \approx x_t^\star, x_0]$ by averaging the $x_{t-1}$ whose $x_t$ landed near a target value, and compare against the closed form. ```{r gm-posterior-mean} t <- 20 x0_scalar <- 2.0 at <- alpha[t] abar_t <- alpha_bar[t] abar_tm1 <- alpha_bar[t - 1] bt <- beta[t] # Closed-form forward posterior mean tilde_mu_t at a chosen x_t = xt_star xt_star <- 1.0 mu_closed <- (sqrt(abar_tm1) * bt) / (1 - abar_t) * x0_scalar + (sqrt(at) * (1 - abar_tm1)) / (1 - abar_t) * xt_star # Monte Carlo: simulate the two-step forward chain from x0, bin on x_t N <- 4e6 xtm1 <- sqrt(abar_tm1) * x0_scalar + sqrt(1 - abar_tm1) * rnorm(N) xt <- sqrt(at) * xtm1 + sqrt(bt) * rnorm(N) near <- abs(xt - xt_star) < 0.02 mu_mc <- mean(xtm1[near]) knitr::kable( data.frame( quantity = c("closed-form tilde_mu_t", "Monte Carlo estimate"), value = round(c(mu_closed, mu_mc), 4) ), col.names = c("Quantity", "Value"), caption = "Closed-form forward posterior mean matches the simulated conditional mean." ) ``` The two values agree to within Monte Carlo error, confirming the Gaussian posterior mean that the variational bound relies on. ### Optimal GAN discriminator as a density ratio Take real data as $\mathcal{N}(1, 1)$ and the generator distribution as $\mathcal{N}(-1, 1.5^2)$. The theory says the best discriminator is $D^*(x) = p_{\mathrm{data}}(x) / (p_{\mathrm{data}}(x) + p_g(x))$. We compute that curve directly and also fit a discriminator by logistic regression on real versus fake samples to check that the trained classifier recovers the same shape. @fig-generative-models-gan-discriminator overlays the closed form curve and the fitted classifier. ```{r fig-generative-models-gan-discriminator, fig.width = 7, fig.height = 4, fig.cap = "The optimal GAN discriminator as a density ratio. The closed form curve $D^*(x) = p_{data}/(p_{data}+p_g)$ and a logistic regression fitted on real versus fake samples coincide, sitting near one where the data density dominates and near zero where the generator density dominates."} # True densities p_data <- function(x) dnorm(x, mean = 1, sd = 1.0) p_g <- function(x) dnorm(x, mean = -1, sd = 1.5) xs <- seq(-7, 7, length.out = 400) d_star <- p_data(xs) / (p_data(xs) + p_g(xs)) # Empirical check: label real = 1, fake = 0, fit logistic regression m <- 4000 real <- rnorm(m, mean = 1, sd = 1.0) fake <- rnorm(m, mean = -1, sd = 1.5) df <- data.frame( x = c(real, fake), label = c(rep(1, m), rep(0, m)) ) fit <- glm(label ~ poly(x, 4), data = df, family = binomial()) d_fit <- predict(fit, newdata = data.frame(x = xs), type = "response") plot_df <- rbind( data.frame(x = xs, D = d_star, source = "optimal D*(x)"), data.frame(x = xs, D = d_fit, source = "fitted logistic D") ) ggplot(plot_df, aes(x, D, color = source, linetype = source)) + geom_line(linewidth = 1) + ylim(0, 1) + labs( title = "Optimal GAN discriminator is the density ratio", x = "x", y = "D(x)", color = NULL, linetype = NULL ) + theme(legend.position = "bottom") ``` The closed form $D^*$ and the fitted classifier overlap closely. Where the data density dominates, $D^*$ is near one. Where the generator density dominates, $D^*$ is near zero. At the crossover point the two densities are equal and $D^* = 0.5$, which is the point a perfectly trained generator drives the whole curve toward. @tbl-generative-models-gan-jsd confirms numerically that the value at the optimal discriminator equals $-\log 4 + 2\,D_{\mathrm{JS}}$. ```{r tbl-generative-models-gan-jsd} # Numerically relate the value at D* to Jensen-Shannon divergence mix <- function(x) 0.5 * (p_data(x) + p_g(x)) kl_int <- function(p, q) { integrate(function(x) { pp <- p(x); qq <- q(x) ifelse(pp > 0, pp * log(pp / qq), 0) }, -20, 20)$value } jsd <- 0.5 * kl_int(p_data, mix) + 0.5 * kl_int(p_g, mix) value_at_opt <- -log(4) + 2 * jsd knitr::kable( data.frame( quantity = c("JS divergence", "value -log4 + 2*JSD", "-log 4"), value = round(c(jsd, value_at_opt, -log(4)), 4) ), col.names = c("Quantity", "Value"), caption = "GAN value at the optimal discriminator equals -log4 + 2 times JS divergence." ) ``` When the two distributions differ, the JS divergence is positive and the generator pays a penalty above the floor of $-\log 4$. The generator minimizes this penalty by matching $p_{\mathrm{data}}$, at which point the JS divergence is zero and the value reaches $-\log 4$. ## Comparison Now that each family has been seen on its own, it helps to line them up against each other. They trade off sample quality, training stability, whether they give a likelihood, and sampling speed, and no single one wins on every axis. Table @tbl-generative-models-comparison lays out the trade-offs side by side. ```{r tbl-generative-models-comparison} cmp <- data.frame( Property = c( "Sample quality", "Training stability", "Likelihood", "Sampling speed", "Latent space" ), VAE = c( "moderate, often blurry", "stable", "lower bound (ELBO)", "fast, one decoder pass", "smooth and structured" ), GAN = c( "sharp, high quality", "unstable, mode collapse", "none, implicit", "fast, one generator pass", "usable but less regular" ), Diffusion = c( "very high quality", "stable, simple MSE loss", "lower bound, tractable", "slow, many denoising steps", "no compact latent code" ) ) knitr::kable(cmp, caption = "VAE, GAN, and diffusion at a glance.") ``` In short: VAEs give a clean probabilistic latent space and stable training but softer samples; GANs give sharp samples at the cost of fragile training; and diffusion models give the best samples with stable training but pay for it at sampling time because every sample needs many sequential steps. ::: {.callout-tip title="When to use this"} Reach for a VAE when you want a smooth, interpretable latent space or a likelihood-style score (anomaly detection, representation learning). Reach for a GAN when sharp samples matter more than coverage or stability. Reach for a diffusion model when sample quality is the priority and you can afford slower generation. These trade-offs shift as the field moves, so treat the table as a snapshot rather than a verdict. ::: ## Further Reading - Kingma and Welling (2014), Auto-Encoding Variational Bayes. The original VAE paper with the ELBO and the reparameterization trick. - Goodfellow et al. (2014), Generative Adversarial Nets. Introduces the minimax game and proves the optimal discriminator and JS divergence result. - Sohl-Dickstein et al. (2015), Deep Unsupervised Learning using Nonequilibrium Thermodynamics. The first diffusion model for generative modeling. - Ho et al. (2020), Denoising Diffusion Probabilistic Models. The simplified noise prediction objective used in modern diffusion models.

42.1 Generative versus Discriminative

42.2 Variational Autoencoders

42.2.1 The problem with the integral

42.2.2 The evidence lower bound

42.2.3 Amortized inference and the \(\beta\)-VAE

42.2.4 The reparameterization trick

42.3 Generative Adversarial Networks

42.3.1 The optimal discriminator

42.3.2 Mode collapse and instability

42.3.3 Why JS divergence is a brittle objective

42.4 Diffusion Models

42.4.1 Forward noising process

42.4.2 Closed form for any step

42.4.3 Reverse denoising model

42.4.4 From the variational bound to the simple loss

42.4.5 Connection to score matching

42.5 Runnable Demonstration

42.5.1 Forward diffusion on a 2-D mixture

42.5.2 Verifying the forward posterior mean

42.5.3 Optimal GAN discriminator as a density ratio

42.6 Comparison

42.7 Further Reading