Advanced Data Analysis

Nguyen, Mike

61 Transductive Learning

61.1 Intuition

Most supervised learning quietly asks for more than the task at hand needs. You fit a model on training data, and the model is a rule: hand it any future input and it returns a prediction. That rule is reusable, which sounds like a virtue. But building a rule that works everywhere is harder than answering the specific question you were actually asked. Often the question is narrow: here is a fixed batch of records, please label these and only these.

Consider a few concrete shapes of that narrow question. A genomics lab has sequenced a particular cohort of patients and wants disease-risk calls for those patients, not a general clinical tool. A news desk has scraped the articles published today and wants each one tagged by topic, not a classifier for all articles ever written. A fraud team has a snapshot of this morning’s pending transactions and wants a verdict on each before the batch clears. In every case the set of points needing a prediction is known, in full, at training time, including their feature vectors. The labels are missing; the inputs are not.

Transductive learning is the discipline of exploiting exactly that situation. Instead of learning a general function $f$ and then evaluating it at the test points, it predicts the labels of the specific test points directly, using their features as part of the fitting process. The unlabeled test inputs are not held back, they are placed on the table alongside the labeled examples and allowed to shape the answer.

Key idea

Inductive learning answers “what is the rule for any future point?” Transductive learning answers “what are the labels of these particular points I already hold?” The second is a smaller question, and you should not solve a harder problem than the one you have.

That last sentence is Vapnik’s principle, and it is the philosophical engine of this chapter. Vapnik (1998) put it sharply: when solving a problem of interest, do not solve a more general problem as an intermediate step. Estimating a function over the whole input space, only to read off its values at a handful of points, is a detour. If you know the points in advance, aim straight at them.

This chapter develops that idea in three settings. First the formal distinction between induction and transduction, with the notation pinned down. Then transductive support vector machines, which push a margin through low-density regions defined partly by the test points. Then graph-based label propagation and label spreading, where labels diffuse from a few labeled seeds across a similarity graph built from labeled and unlabeled points together. We connect all of this to semi-supervised learning (Chapter 48), since transduction is one of its two faces, and we are honest about the limits: a purely transductive method gives you labels for the batch and nothing for genuinely new arrivals. We close with a runnable label-propagation demo on a two-moons dataset, with a figure of the propagated labels and an accuracy table.

61.2 Induction versus Transduction

We observe a labeled set $L = \{(x_1, y_1), \dots, (x_l, y_l)\}$ and an unlabeled set $U = \{x_{l+1}, \dots, x_{l+u}\}$, with inputs $x_i \in \mathbb{R}^p$ and labels $y_i \in \{1, \dots, C\}$. Write $n = l + u$ for the total. In the transductive setting the unlabeled points $U$ are exactly the test points: there is no separate, unseen test set, and the deliverable is the labeling $(\hat y_{l+1}, \dots, \hat y_{l+u})$ of $U$.

The two learning modes differ in what they produce.

Note

Inductive learning produces a function $\hat f : \mathbb{R}^p \to \{1, \dots, C\}$ defined on the whole input space, then sets $\hat y_i = \hat f(x_i)$ for any point of interest, seen or unseen. Transductive learning produces only the vector of labels on $U$. There is no $\hat f$ to carry forward.

The gap matters statistically. Inductive learning controls the risk of $\hat f$ over the entire distribution $p(x)$, a uniform guarantee across all possible inputs. Transduction needs only that the labeling be good on the finite set $U$, which is a weaker requirement and, by Vapnik’s argument, can be met with tighter sample-dependent bounds. Vapnik (1998) and Vapnik (2006) derive transductive generalization bounds that depend on the actual configuration of the $n$ points rather than on the capacity of a function class over all of $\mathbb{R}^p$.

A small but important subtlety: a transductive method can always be turned into an inductive one after the fact. Once $U$ has been labeled, the now fully-labeled set $L \cup U$ can train an ordinary supervised model, or you can attach an out-of-sample extension that maps a new point to a prediction by its similarity to the labeled batch. The reverse is cheap too: any inductive $\hat f$ yields transductive labels by evaluation at $U$. The distinction is therefore about what the method optimizes and what guarantees it carries, not an impassable wall.

61.2.1 Why the transductive bound can be tighter

The statistical advantage of transduction is concrete, not merely philosophical, and it is worth making precise. Fix the full pool of $n = l + u$ points and their inputs. In the transductive protocol the labeled set $L$ is a random subset of size $l$ drawn without replacement from this finite pool, and the test set is the complementary $u$ points. Because the pool is fixed, the relevant complexity is not the capacity of a function class over all of $\mathbb{R}^p$ but the number of distinct labelings the hypothesis class can realize on these $n$ points, a quantity bounded by the empirical growth function $\mathcal{N}(\mathcal{H}, n)$.

The driving probabilistic fact is a concentration inequality for sampling without replacement. Let $a_1, \dots, a_n$ be fixed numbers (here, the per-point losses of a fixed labeling) with mean $\bar a = \tfrac1n \sum_i a_i$, and let $\bar a_L = \tfrac1l \sum_{i \in L} a_i$ be the average over a random size-$l$ subset. Hoeffding’s inequality for sampling without replacement gives, for $a_i \in [0,1]$,

\[ \Pr\!\left( \bar a_L - \bar a \ge \epsilon \right) \le \exp\!\left( -\frac{2\, l\, n}{u}\, \epsilon^2 \right), \]

which is sharper than the with-replacement bound $\exp(-2 l \epsilon^2)$ by the finite-population correction factor $1/(1 - l/n) = n/u > 1$, giving the larger exponent $2 l n / u$ and hence a tighter bound (reflecting the reduced variance of sampling without replacement). The gap between the labeled (empirical) error and the unlabeled (test) error of any fixed labeling is therefore controlled by $n/(l u) = (1/l)/(1 - l/n)$ rather than $1/l$. Taking a union bound over the $\mathcal{N}(\mathcal{H}, n)$ distinct labelings yields a bound of the schematic form

\[ R_u(h) \;\le\; R_l(h) \;+\; O\!\left( \sqrt{ \frac{ \log \mathcal{N}(\mathcal{H}, n) }{ l\,(1 - l/n) } } \right) \;=\; R_l(h) \;+\; O\!\left( \sqrt{ \frac{ n\,\log \mathcal{N}(\mathcal{H}, n) }{ l\, u } } \right), \qquad \tag{61.1}\]

where $R_l$ and $R_u$ are the error rates on the labeled and unlabeled portions (Vapnik 1998; Vapnik 2006). Two features distinguish Equation 61.1 from a standard inductive bound. First, $\log \mathcal{N}(\mathcal{H}, n)$ is data-dependent: it counts only the labelings achievable on the actual configuration of points, which can be far smaller than a worst-case VC term when the points cluster. Second, the finite-population factor $1/(1 - l/n)$ is close to $1$ in the typical transductive regime $l \ll n$ (so $u \approx n$), where the bound is at its tightest; it grows as $l \to n$ because then only a tiny unlabeled set $u \to 0$ remains to be predicted, which is intrinsically high-variance. This is the formal content of “do not solve a harder problem than the one you have”: the inductive bound must hold uniformly over $p(x)$, while Equation 61.1 only governs the $n$ points on the table.

Intuition

Picture a connect-the-dots puzzle. Induction is drawing the full underlying curve so you could place a dot anywhere. Transduction is filling in only the numbered dots you were given. If all you need are those dots, drawing the entire curve first is wasted effort, and worse, you might draw the curve badly in regions you were never asked about and let that error leak back into the dots you cared about.

61.3 Transductive Support Vector Machines

The first concrete transductive method extends the support vector machine. Recall the standard (inductive) soft-margin SVM for a linear decision function $f(x) = w^\top x + b$ with labels $y_i \in \{-1, +1\}$. It minimizes

\[ \min_{w, b, \xi}\; \tfrac{1}{2}\lVert w \rVert^2 + C \sum_{i=1}^{l} \xi_i \quad \text{s.t.} \quad y_i (w^\top x_i + b) \ge 1 - \xi_i,\; \xi_i \ge 0 , \]

where $\xi_i$ are slack variables and $C$ trades margin width against training error. This uses only the labeled points $L$. (See the support vector machines chapter, Chapter 19, for the full development.)

The transductive SVM (TSVM) of Vapnik (1998), made practical by Joachims (1999), brings the unlabeled test points into the objective. The idea rests on the cluster assumption: the decision boundary should fall in a low-density region of $p(x)$, so it should not slice through a dense clump of unlabeled points. We therefore treat the unknown labels $y^*_{l+1}, \dots, y^*_{l+u}$ of $U$ as additional optimization variables and ask the margin to be large for the unlabeled points too:

\[ \min_{w, b, \xi, \xi^*, \, y^*_{j}}\; \tfrac{1}{2}\lVert w \rVert^2 + C \sum_{i=1}^{l} \xi_i + C^* \sum_{j=l+1}^{l+u} \xi^*_j \]

subject to, for the labeled points,

\[ y_i (w^\top x_i + b) \ge 1 - \xi_i, \quad \xi_i \ge 0, \qquad i = 1, \dots, l , \]

and for the unlabeled points with their assigned labels $y^*_j \in \{-1, +1\}$,

\[ y^*_j (w^\top x_j + b) \ge 1 - \xi^*_j, \quad \xi^*_j \ge 0, \qquad j = l+1, \dots, l+u . \]

The new term $C^* \sum_j \xi^*_j$ penalizes unlabeled points that land inside the margin, so the optimizer prefers a hyperplane that keeps the whole unlabeled cloud clear of the boundary. The penalty weight $C^*$ governs how much the test points are allowed to bend the solution.

Intuition

A plain SVM positions a fence using only the flagged animals on each side. A TSVM also sees the unflagged herd milling around and refuses to run the fence straight through the middle of a tight group, because a good boundary should sit in the empty corridor between herds, not split one in two.

61.3.1 The effective loss and low-density separation

The discrete labels can be eliminated to expose what the TSVM really penalizes on the unlabeled points. For a single unlabeled point $x_j$ the inner optimization over its label is

\[ \min_{y^*_j \in \{-1,+1\}}\; \min_{\xi^*_j \ge 0}\; \xi^*_j \;\;\text{s.t.}\;\; y^*_j (w^\top x_j + b) \ge 1 - \xi^*_j . \]

For a fixed sign of $y^*_j$ the optimal slack is the hinge $\xi^*_j = \max(0,\, 1 - y^*_j f(x_j))$ with $f(x) = w^\top x + b$. Minimizing over the two label choices gives the closed form

\[ \min_{y^*_j} \max\!\big(0,\, 1 - y^*_j f(x_j)\big) \;=\; \max\!\big(0,\, 1 - |f(x_j)|\big) , \qquad \tag{61.2}\]

since the better of the two labels is $y^*_j = \operatorname{sign} f(x_j)$. The unlabeled term in the TSVM objective is therefore $C^* \sum_j \max(0, 1 - |f(x_j)|)$. This function, often called the hat loss or symmetric hinge, is plotted as a function of the signed output $f(x_j)$ as an inverted “$\wedge$”: it is zero when $|f(x_j)| \ge 1$ (the point sits outside the margin, either side) and grows as the point approaches the boundary $f = 0$.

Note

Equation 61.2 makes the cluster assumption operational. The hat loss is large exactly for unlabeled points near the decision boundary, so minimizing it pushes the boundary away from where unlabeled points are dense, that is, into a low-density region of $p(x)$. This is the precise sense in which a TSVM implements low-density separation. The non-convexity of the whole problem is now visible too: the hat loss is non-convex in $f(x_j)$ (it decreases then increases), so summing it over unlabeled points creates the many local minima that plague TSVM training.

Two further structural points. First, without a constraint the optimizer can make the unlabeled term vanish trivially by sending all unlabeled points to one side (giving them all the same label and a large margin), so practical TSVMs impose a balancing constraint

\[ \frac{1}{u} \sum_{j=l+1}^{l+u} \max\!\big(0,\, \operatorname{sign} f(x_j)\big) \;=\; r , \]

forcing the predicted positive fraction on $U$ to match an assumed ratio $r$ (often the labeled-set positive fraction). Second, Equation 61.2 connects the TSVM to entropy regularization: penalizing $\max(0, 1 - |f|)$ discourages confident-margin-violating outputs in the same spirit that an entropy penalty $-\sum_c p_c \log p_c$ on a probabilistic model pushes posteriors toward $0/1$ in low-density regions (Grandvalet and Bengio 2005). Both are surrogates for the same prior: the boundary belongs where data is sparse.

The catch is that the labels $y^*_j$ are discrete unknowns, so the problem is no longer a convex quadratic program; it is a mixed-integer, non-convex optimization, NP-hard in general. Joachims (1999) gives a label-switching heuristic ($\mathrm{SVM}^{light}$): start from the labels an inductive SVM would assign, anneal $C^*$ upward from a small value, and iteratively flip pairs of unlabeled labels whenever doing so lowers the objective. Other approaches relax the integer labels continuously (Chapelle, Sindhwani, and Keerthi, 2008). All of them inherit sensitivity to initialization and to the assumed class balance on $U$.

Warning

TSVMs are notoriously fiddly. The non-convex objective has many local minima, and if the cluster assumption fails (classes overlap heavily, or the test cloud is not cleanly separable) the unlabeled term can drag the boundary into a worse place than the labeled-only SVM. Treat $C^*$ and the assumed positive fraction on $U$ as hyperparameters to validate, not constants to trust.

We do not run a TSVM in the demo below because the off-the-shelf R tooling for it is thin and the optimization is finicky; instead we use graph-based propagation, which is convex, transparent, and easy to implement in base R. The TSVM remains worth knowing as the canonical low-density-separation transductive method.

61.4 Graph-Based Label Propagation

Graph-based methods take a different route to the same goal. They lean on the manifold assumption: the data lie near a low-dimensional manifold, and labels vary smoothly along it, so two points close on the manifold should share a label even if other measures of distance disagree. We encode “closeness” in a graph and let labels diffuse along its edges.

61.4.1 Building the graph

Put all $n = l + u$ points (labeled and unlabeled together) as nodes of a weighted graph. The edge weight between points $i$ and $j$ measures similarity, commonly a Gaussian (radial basis function) kernel on Euclidean distance:

\[ W_{ij} = \exp\!\left( -\frac{\lVert x_i - x_j \rVert^2}{2 \sigma^2} \right), \qquad W_{ii} = 0 , \]

where the bandwidth $\sigma$ sets the scale of “near.” Large $\sigma$ connects almost everything; small $\sigma$ keeps only the closest neighbors linked. In practice one often sparsifies $W$ to a $k$-nearest-neighbor graph, but the dense kernel is enough to see the mechanism. Define the diagonal degree matrix $D$ with $D_{ii} = \sum_j W_{ij}$.

61.4.2 The label-propagation algorithm

Zhu and Ghahramani (2002) propose the following. Encode labels as an $n \times C$ matrix $F$, where row $i$ is a distribution over the $C$ classes; for labeled points fix that row to the one-hot indicator of the true class. Form the row-normalized transition matrix

\[ P = D^{-1} W, \qquad P_{ij} = \frac{W_{ij}}{\sum_k W_{ik}} , \]

so $P_{ij}$ is the probability of stepping from $i$ to $j$ in one move of a random walk on the graph. Then iterate two steps to convergence:

Propagate: $F \leftarrow P F$. Each node averages its neighbors’ current label distributions, weighted by similarity.
Clamp: reset the rows of $F$ for labeled points back to their known one-hot labels.

Clamping is what makes the labeled seeds act as permanent sources: every sweep lets their influence flow one more hop into the unlabeled mass, and resetting them prevents that influence from being diluted away. Iterating drives $F$ to a fixed point. The final label of an unlabeled point is $\hat y_i = \arg\max_c F_{ic}$.

There is a clean closed form. Partition $W$, $P$, and $F$ into labeled ($\ell$) and unlabeled ($u$) blocks,

\[ P = \begin{pmatrix} P_{\ell\ell} & P_{\ell u} \\ P_{u\ell} & P_{uu} \end{pmatrix}, \qquad F = \begin{pmatrix} F_\ell \\ F_u \end{pmatrix}, \]

with $F_\ell$ held fixed at the known labels $Y_\ell$. At the fixed point the unlabeled rows satisfy $F_u = P_{u\ell} Y_\ell + P_{uu} F_u$, which solves to

\[ F_u = (I - P_{uu})^{-1} P_{u\ell}\, Y_\ell . \]

This is the harmonic solution of Zhu, Ghahramani, and Lafferty (2003): the propagated label scores are a harmonic function on the graph, minimizing the energy $\tfrac{1}{2} \sum_{i,j} W_{ij} (F_i - F_j)^2$ subject to matching the labeled values. Smoothness over the graph and fixed seeds together pin down a unique answer.

61.4.3 Deriving the harmonic solution

It is worth deriving the closed form from the energy directly, because the same machinery (the graph Laplacian) recurs throughout graph-based learning. Work with one label column $f \in \mathbb{R}^n$ at a time (the columns of $F$ separate). The smoothness energy is

\[ E(f) \;=\; \frac{1}{2} \sum_{i,j} W_{ij}\,(f_i - f_j)^2 . \]

Expanding the square and using $D_{ii} = \sum_j W_{ij}$,

\[ E(f) = \frac{1}{2}\sum_{i,j} W_{ij}\,(f_i^2 - 2 f_i f_j + f_j^2) = \sum_i D_{ii} f_i^2 - \sum_{i,j} W_{ij} f_i f_j = f^\top (D - W)\, f = f^\top L f , \]

where $L = D - W$ is the (unnormalized) graph Laplacian. The energy is a quadratic form in the Laplacian, and since $E(f) \ge 0$ for all $f$, $L$ is positive semidefinite. Minimizing $E$ subject to clamping the labeled entries $f_\ell = y_\ell$ is a constrained quadratic problem. Partition $L$ in the labeled/unlabeled blocks,

\[ L = \begin{pmatrix} L_{\ell\ell} & L_{\ell u} \\ L_{u\ell} & L_{uu} \end{pmatrix}, \qquad f = \begin{pmatrix} y_\ell \\ f_u \end{pmatrix} . \]

Only $f_u$ is free, so we set the gradient of $E$ with respect to $f_u$ to zero. Writing $E = f^\top L f$ and differentiating the block form,

\[ \frac{\partial E}{\partial f_u} = 2\,\big(L_{u\ell}\, y_\ell + L_{uu}\, f_u\big) = 0 \quad\Longrightarrow\quad f_u = -\,L_{uu}^{-1} L_{u\ell}\, y_\ell . \qquad \tag{61.3}\]

The minimizer exists and is unique whenever $L_{uu}$ is invertible, which holds exactly when every connected component of the graph contains at least one labeled point (otherwise an unlabeled component is free to take any constant value at zero energy, and the solution is undetermined). This is the algebraic reason disconnected, seed-free components are a silent failure mode.

To see that Equation 61.3 is the same fixed point as propagate-and-clamp, recall $P = D^{-1} W$ so that $I - P_{uu} = D_u^{-1}(D_u - W_{uu}) = D_u^{-1} L_{uu}$ and $P_{u\ell} = D_u^{-1} W_{u\ell} = -D_u^{-1} L_{u\ell}$ (using $L_{u\ell} = -W_{u\ell}$ since the off-diagonal labeled/unlabeled block of $D$ is zero). Hence

\[ (I - P_{uu})^{-1} P_{u\ell} = (D_u^{-1} L_{uu})^{-1}(-D_u^{-1} L_{u\ell}) = -L_{uu}^{-1} L_{u\ell} , \]

so $(I - P_{uu})^{-1} P_{u\ell} y_\ell = -L_{uu}^{-1} L_{u\ell} y_\ell = f_u$. The iterative algorithm and the energy minimizer coincide.

The name “harmonic” is literal. At the optimum each unlabeled coordinate satisfies the discrete mean-value property: from $L_{uu} f_u + L_{u\ell} y_\ell = 0$, the $i$-th unlabeled equation reads $D_{ii} f_i = \sum_j W_{ij} f_j$, that is,

\[ f_i = \frac{1}{D_{ii}} \sum_{j} W_{ij}\, f_j = \sum_j P_{ij}\, f_j , \]

so every unlabeled value is the weighted average of its neighbors. This is the discrete analogue of $\nabla^2 f = 0$, and it also gives a probabilistic reading: $f_i$ equals the probability that a random walk started at $i$, with transition matrix $P$, reaches a positively-labeled seed before a negatively-labeled one (the absorbing random walk interpretation of Zhu, Ghahramani, and Lafferty 2003).

61.4.4 Convergence of propagate-and-clamp

The iteration $f_u \leftarrow P_{uu} f_u + P_{u\ell} y_\ell$ converges geometrically. Its iteration matrix is $P_{uu}$, a submatrix of the stochastic matrix $P$ with some row sums strictly less than one (any unlabeled node adjacent to a seed leaks probability mass into the clamped block). For a connected graph with at least one seed, the spectral radius satisfies $\rho(P_{uu}) < 1$, so the affine map is a contraction and

\[ \lVert f_u^{(t)} - f_u^\star \rVert \le \rho(P_{uu})^{\,t}\, \lVert f_u^{(0)} - f_u^\star \rVert . \]

The error contracts by a factor $\rho(P_{uu})$ per sweep, so reaching tolerance $\varepsilon$ takes $O\!\big(\log(1/\varepsilon) / \log(1/\rho(P_{uu}))\big)$ sweeps. The gap $1 - \rho(P_{uu})$ shrinks when seeds are sparse or poorly connected, which is why few, scattered labels both slow convergence and weaken the solution.

Intuition

Drop ink (the labeled colors) into a few cups of a tray of connected water cells; thicker tubes (larger $W_{ij}$) between cells let color flow faster. Keep re-inking the source cups every round so they never fade. After things settle, each cell’s tint is a weighted blend of the sources reachable through the tubing, and the dominant tint is its predicted label.

61.4.5 Label spreading

Label propagation clamps the labeled points hard, which assumes the given labels are exactly right. Zhou et al. (2004) propose label spreading, a softer variant that tolerates label noise and uses a symmetric normalization. Define the symmetrically normalized weight matrix

\[ S = D^{-1/2} W D^{-1/2} , \]

and iterate

\[ F \leftarrow \alpha S F + (1 - \alpha) Y , \]

where $Y$ is the initial label matrix (one-hot on labeled rows, zero on unlabeled rows) and $\alpha \in (0, 1)$ balances trusting the graph ($\alpha$ near $1$) against trusting the original labels ($\alpha$ near $0$). This iteration is a contraction and converges to the closed form

\[ F^* = (1 - \alpha)\,(I - \alpha S)^{-1} Y . \]

61.4.6 Deriving label spreading from a regularized objective

Both the closed form and the convergence proof fall out of a single regularization functional. Zhou et al. (2004) define, for each label column $f$,

\[ Q(f) = \underbrace{\frac{1}{2} \sum_{i,j} W_{ij} \left( \frac{f_i}{\sqrt{D_{ii}}} - \frac{f_j}{\sqrt{D_{jj}}} \right)^2}_{\text{smoothness}} \;+\; \underbrace{\frac{\mu}{2} \sum_i (f_i - y_i)^2}_{\text{fitting}} , \qquad \tag{61.4}\]

where $y$ is the initial label vector and $\mu > 0$ trades smoothness against fidelity to the given labels. The smoothness term is the quadratic form of the normalized Laplacian: expanding it gives $f^\top (I - S) f$ with $S = D^{-1/2} W D^{-1/2}$, because the degree normalization $f_i / \sqrt{D_{ii}}$ converts $D - W$ into $I - S$. So

\[ Q(f) = f^\top (I - S) f + \frac{\mu}{2}\,\lVert f - y \rVert^2 . \]

Setting $\nabla Q(f) = 0$,

\[ 2(I - S) f + \mu (f - y) = 0 \;\Longrightarrow\; f^\star = \big(I - S + \tfrac{\mu}{2} I\big)^{-1} \tfrac{\mu}{2}\, y . \]

Define $\alpha = \dfrac{1}{1 + \mu/2}$, equivalently $\mu/2 = (1-\alpha)/\alpha$. Multiplying numerator and denominator by $\alpha$,

\[ f^\star = \alpha\big(\alpha(I - S) + (1-\alpha) I\big)^{-1} \tfrac{1-\alpha}{\alpha}\, y = (1 - \alpha)\,(I - \alpha S)^{-1} y , \]

which is exactly the stated closed form. The parameter $\alpha \in (0,1)$ is thus a reparameterization of the smoothness/fidelity weight $\mu$: $\alpha \to 1$ means $\mu \to 0$ (trust the graph) and $\alpha \to 0$ means $\mu \to \infty$ (trust the labels, recovering hard clamping in the limit).

The contraction claim is now immediate. The iteration $F \leftarrow \alpha S F + (1-\alpha) Y$ has iteration matrix $\alpha S$. The symmetric matrix $S = D^{-1/2} W D^{-1/2}$ is similar to the stochastic matrix $P = D^{-1} W$ (since $S = D^{1/2} P D^{-1/2}$), so they share eigenvalues, all lying in $[-1, 1]$; hence $\rho(S) \le 1$ and $\rho(\alpha S) = \alpha\, \rho(S) \le \alpha < 1$. The map is a contraction with factor at most $\alpha$, so the Neumann series converges,

\[ F^{(t)} = (\alpha S)^t F^{(0)} + (1-\alpha) \sum_{k=0}^{t-1} (\alpha S)^k Y \;\xrightarrow[t\to\infty]{}\; (1-\alpha)\sum_{k=0}^{\infty} (\alpha S)^k Y = (1-\alpha)(I - \alpha S)^{-1} Y , \]

independently of the initialization $F^{(0)}$. The series form also exposes the mechanism: the contribution of $Y$ propagated over $k$ hops is geometrically discounted by $\alpha^k$, so label spreading is a soft, distance-decayed diffusion in contrast to label propagation’s hard absorbing walk.

Note

The two methods differ in two knobs. Propagation uses the random-walk normalization $P = D^{-1}W$ and clamps seeds exactly; spreading uses the symmetric normalization $S = D^{-1/2} W D^{-1/2}$ and only pulls toward seeds with strength $1 - \alpha$. When labels are clean and few, propagation is the natural default; when some seed labels may be wrong, spreading’s soft clamping is safer.

Both methods are inherently transductive: they return labels for the $n$ nodes that were in the graph and nothing else. A point that arrives after the graph is built has no row in $F$. To label it you must either rebuild the graph with the new point included and re-solve (back to transduction on a larger batch) or add an out-of-sample extension, for example a Nadaraya-Watson average of neighbor labels (Chapter 4) using the same kernel.

61.5 Connection to Semi-Supervised Learning

Transduction and semi-supervised learning (SSL) overlap heavily but are not identical, and keeping them straight avoids confusion.

SSL (Chapter 48) is defined by its data: a little labeled, a lot unlabeled. Transduction versus induction is a distinction about the goal: label this fixed batch, or learn a reusable rule. Every transductive method is doing semi-supervised learning, since it uses unlabeled points during fitting. But not every SSL method is transductive: self-training and co-training (covered in the SSL chapter) produce a reusable inductive classifier, while label propagation and TSVMs produce a batch labeling.

Key idea

SSL describes the data regime (few labels, many unlabeled). Transduction describes the target (a fixed set of points to label, no model for the future). They co-occur because the cheapest way to exploit unlabeled data is often to label it directly, which is exactly transduction. The two assumptions that make SSL work, the cluster assumption and the manifold assumption, are the same two assumptions that justify TSVMs and graph propagation respectively.

The practical upshot: if your unlabeled data is the test set and you will never see those exact points again as a population worth modeling, reach for a transductive method and skip the detour of fitting a general $\hat f$. If you need to score a stream of future points, you want induction, and SSL’s inductive methods (or a transductive method followed by retraining a supervised model on the filled-in labels) are the right tools.

61.6 Runnable Demo: Label Propagation on Two Moons

We now implement label propagation from scratch in base R on a two-moons dataset, the standard stress test for the manifold assumption: two interleaving crescents that no linear boundary separates but that are obvious to a human eye. We reveal only a couple of labeled seeds per moon and let the algorithm propagate labels to every remaining point, treating those points as the known transductive test set.

The first chunk generates the data and reveals four seeds (two per class).

Show code

set.seed(1301)

# Generate a two-moons dataset: two interleaving half-circles.
make_two_moons <- function(n_per, noise = 0.12) {
  t_top <- seq(0, pi, length.out = n_per)
  x_top <- cbind(cos(t_top), sin(t_top))
  t_bot <- seq(0, pi, length.out = n_per)
  x_bot <- cbind(1 - cos(t_bot), -sin(t_bot) - 0.5)
  X <- rbind(x_top, x_bot)
  X <- X + matrix(rnorm(length(X), sd = noise), ncol = 2)
  y <- c(rep(1L, n_per), rep(2L, n_per))
  list(X = X, y = y)
}

n_per <- 100
dat <- make_two_moons(n_per)
X <- dat$X
y_true <- dat$y
n <- nrow(X)

# Reveal only a few labeled seeds; everything else is the transductive test set.
seeds_per_class <- 2
seed_idx <- c(
  sample(which(y_true == 1L), seeds_per_class),
  sample(which(y_true == 2L), seeds_per_class)
)
is_labeled <- rep(FALSE, n)
is_labeled[seed_idx] <- TRUE

cat("Total points:", n,
    "| labeled seeds:", sum(is_labeled),
    "| unlabeled (transductive test):", sum(!is_labeled), "\n")
#> Total points: 200 | labeled seeds: 4 | unlabeled (transductive test): 196

The second chunk builds the Gaussian-kernel graph, runs propagate-and-clamp to convergence, and reads off the transductive labels.

Show code

# Gaussian (RBF) similarity graph over ALL points, labeled and unlabeled.
sq_dist <- as.matrix(dist(X))^2
sigma <- 0.30
W <- exp(-sq_dist / (2 * sigma^2))
diag(W) <- 0

# Row-normalized random-walk transition matrix P = D^{-1} W.
D <- rowSums(W)
P <- W / D

# Label matrix F: one row per point, one column per class (C = 2).
C <- 2L
F_mat <- matrix(0, nrow = n, ncol = C)
for (i in which(is_labeled)) F_mat[i, y_true[i]] <- 1

# Save the clamped seed rows so we can reset them each sweep.
Y_seed <- F_mat[is_labeled, , drop = FALSE]

# Iterate: propagate (F <- P F) then clamp the labeled rows.
for (iter in 1:1000) {
  F_new <- P %*% F_mat
  F_new[is_labeled, ] <- Y_seed         # clamp seeds back to known labels
  if (max(abs(F_new - F_mat)) < 1e-9) {
    F_mat <- F_new
    break
  }
  F_mat <- F_new
}

# Transductive prediction: argmax over classes for every point.
y_prop <- max.col(F_mat, ties.method = "first")

# Accuracy on the unlabeled (test) points only.
test_idx <- which(!is_labeled)
acc_prop <- mean(y_prop[test_idx] == y_true[test_idx])
cat("Converged in", iter, "iterations.",
    "Transductive accuracy on unlabeled points:",
    round(acc_prop, 3), "\n")
#> Converged in 1000 iterations. Transductive accuracy on unlabeled points: 1

We can confirm numerically that the iterative propagate-and-clamp result equals the harmonic closed form Equation 61.3 derived above, $F_u = -L_{uu}^{-1} L_{u\ell} Y_\ell$ with $L = D - W$.

Show code

# Verify: iterative fixed point == harmonic closed form via the Laplacian.
L_mat <- diag(D) - W                      # unnormalized graph Laplacian
u_idx <- which(!is_labeled)
l_idx <- which(is_labeled)
F_u_closed <- -solve(L_mat[u_idx, u_idx]) %*% L_mat[u_idx, l_idx] %*% Y_seed
cat("max abs difference (iteration vs harmonic closed form):",
    format(max(abs(F_mat[u_idx, ] - F_u_closed)), digits = 3), "\n")
#> max abs difference (iteration vs harmonic closed form): 1.93e-06

Note

The bandwidth $\sigma$ is the one knob that matters most here. Too small and the graph fragments into disconnected pieces that the seeds cannot reach (those points keep their initial all-zero rows and get labeled arbitrarily); too large and the two moons fuse into one blob and labels bleed across the gap. We use $\sigma = 0.30$, which connects each crescent internally while keeping the crescents weakly linked.

61.6.1 Baseline comparison

To show what the unlabeled structure buys us, we compare against a purely supervised baseline that sees only the four seeds: a 1-nearest-neighbor classifier (Chapter 17) trained on the seeds and applied to the same test points. With so few labels and a curved boundary, the geometry-blind baseline should struggle where propagation succeeds.

Show code

# 1-NN trained ONLY on the labeled seeds, predicting the unlabeled test points.
nn1 <- function(train_X, train_y, test_X) {
  apply(test_X, 1, function(q) {
    d <- colSums((t(train_X) - q)^2)
    train_y[which.min(d)]
  })
}

y_knn <- nn1(X[is_labeled, , drop = FALSE], y_true[is_labeled], X[test_idx, , drop = FALSE])
acc_knn <- mean(y_knn == y_true[test_idx])

results <- data.frame(
  Method = c("1-NN on seeds only (inductive baseline)",
             "Label propagation (transductive)"),
  `Uses unlabeled points` = c("No", "Yes"),
  `Test accuracy` = c(round(acc_knn, 3), round(acc_prop, 3)),
  check.names = FALSE
)
results
#>                                    Method Uses unlabeled points Test accuracy
#> 1 1-NN on seeds only (inductive baseline)                    No         0.939
#> 2        Label propagation (transductive)                   Yes         1.000

Table 61.1 reports the accuracy of both methods on the unlabeled test points.

Show code

knitr::kable(
  results,
  caption = "Test-set accuracy on the unlabeled two-moons points: a supervised 1-nearest-neighbor classifier using only the four labeled seeds, versus transductive label propagation that also uses the geometry of the unlabeled points."
)

Table 61.1: Test-set accuracy on the unlabeled two-moons points: a supervised 1-nearest-neighbor classifier using only the four labeled seeds, versus transductive label propagation that also uses the geometry of the unlabeled points.

Method	Uses unlabeled points	Test accuracy
1-NN on seeds only (inductive baseline)	No	0.939
Label propagation (transductive)	Yes	1.000

Figure 61.1 shows the outcome. The left panel is the input the algorithm sees: a sea of grey unlabeled points with four colored seeds. The right panel is the transductive result: every point colored by its propagated label, with the seeds ringed in black. Propagation follows each crescent around its curve, which a seed-only classifier cannot do.

Show code

op <- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

cols <- c("#D55E00", "#0072B2")

# Left: what the algorithm starts with.
plot(X, col = "grey75", pch = 16, cex = 0.7,
     xlab = expression(x[1]), ylab = expression(x[2]),
     main = "Input: 4 seeds among unlabeled points")
points(X[is_labeled, , drop = FALSE], col = cols[y_true[is_labeled]],
       pch = 16, cex = 1.8)
points(X[is_labeled, , drop = FALSE], col = "black", pch = 1, cex = 2.0, lwd = 1.5)

# Right: the propagated transductive labeling.
plot(X, col = cols[y_prop], pch = 16, cex = 0.8,
     xlab = expression(x[1]), ylab = expression(x[2]),
     main = "Output: propagated transductive labels")
points(X[is_labeled, , drop = FALSE], col = "black", pch = 1, cex = 2.0, lwd = 1.5)

par(op)

Figure 61.1: Label propagation on two moons. Left: the transductive input, four labeled seeds (colored, ringed) among many unlabeled points (grey). Right: propagated labels for every point, colored by predicted class with seeds ringed in black. Labels flow along each crescent, recovering the curved structure from only four seeds.

The lesson the figure makes concrete: with four labels and no geometry, you cannot recover two interleaving crescents, but with four labels plus the similarity graph over the test points themselves, you can. That is transduction earning its keep.

61.7 Practical Guidance and Pitfalls

A few things decide whether transduction helps or quietly hurts.

When to use this

Reach for a transductive method when (1) the points you must label are fixed and known at training time, (2) labels are scarce while unlabeled inputs are plentiful, and (3) you have reason to believe the cluster or manifold assumption holds, that is, the classes form coherent groups or lie along smooth low-dimensional structure. The two-moons demo satisfies all three.

The single biggest failure mode is the assumption itself. If the classes overlap heavily, or the manifold structure does not align with the labels, the unlabeled points pull predictions toward structure in $p(x)$ that has nothing to do with $y$. Then label propagation smears labels across the true boundary and a TSVM parks its hyperplane in the wrong corridor. Always sanity-check against a supervised baseline trained on the same labeled points; if the transductive method does not beat it, the assumption is probably violated.

Graph construction is the part that most repays care. The bandwidth $\sigma$ (or the neighbor count $k$ in a $k$-NN graph) sets the scale of similarity, and the result can swing from excellent to useless across a factor of two in $\sigma$. A common heuristic sets $\sigma$ to the median pairwise distance among nearby points, or tunes it by cross-validation on whatever labels you can spare. Disconnected components are a silent trap: if a chunk of unlabeled points has no path to any seed, its labels are undetermined and the code will hand back whatever the argmax of an all-zero row happens to be. Check connectivity, and make sure every class has at least one seed in each region it occupies.

Class balance matters too. Label propagation has no notion of prior class frequencies, so if the seeds are wildly unrepresentative the propagated proportions can drift. Some implementations add a class-mass normalization step (Zhu, Ghahramani, and Lafferty, 2003) that rescales the columns of $F$ to match a target class distribution; if you know roughly how the batch splits across classes, use it.

Warning

The defining limitation is structural, not a tuning issue. A purely transductive method gives you labels for the batch in hand and no model for genuinely new points. If tomorrow’s data arrives as a fresh batch, you must rebuild the graph (or rerun the TSVM) on the new points, or you must add an explicit out-of-sample extension or retrain a supervised model on the filled-in labels. Do not promise stakeholders a deployable scoring service from a transductive fit alone.

Finally, watch the cost. The harmonic closed form $F_u = (I - P_{uu})^{-1} P_{u\ell} Y_\ell$ inverts a $u \times u$ matrix, which is $O(u^3)$ time and $O(u^2)$ memory, infeasible for large $u$. Building the dense graph is itself $O(n^2 p)$ for the pairwise distances. The iterative propagate-and-clamp loop is far cheaper per sweep: on a sparse $k$-NN graph with $m = O(kn)$ edges each sweep is a sparse matvec costing $O(m C)$, and from Section 61.4.4 the number of sweeps to tolerance $\varepsilon$ is $O(\log(1/\varepsilon) / (1 - \rho(P_{uu})))$, so total cost is roughly $O\!\big(k n C \log(1/\varepsilon) / (1-\rho(P_{uu}))\big)$. This is what scales to large unlabeled batches. For label spreading the same analysis applies with iteration matrix $\alpha S$ and contraction factor $\alpha$, giving a guaranteed $O(\log(1/\varepsilon)/\log(1/\alpha))$ sweeps regardless of graph connectivity.

61.7.1 Class-mass normalization

Because the harmonic solution ignores class priors, the raw argmax of $F_u$ can be badly miscalibrated when seeds are unrepresentative of the true class proportions. Class-mass normalization (Zhu, Ghahramani, and Lafferty 2003) corrects this. Let $q_c$ be the target mass for class $c$ (for example the labeled-set frequency, or a known batch proportion), and let the current column mass be $\sum_{i \in U} F_{ic}$. Rescale each column,

\[ \tilde F_{ic} = F_{ic}\cdot \frac{q_c}{\sum_{j \in U} F_{jc}} , \]

and predict $\hat y_i = \arg\max_c \tilde F_{ic}$. This reweights the soft scores so the predicted class frequencies match $q$, which typically improves accuracy when classes are imbalanced or seeds are skewed, at the cost of requiring an estimate of $q$.

61.7.2 Bias-variance and the role of the assumptions

Graph-based transduction trades variance for bias in a way controlled by the bandwidth. A small $\sigma$ (or small $k$) yields a near-diagonal graph: predictions depend on very local neighborhoods, so they are low-bias but high-variance, sensitive to individual noisy points and prone to disconnection. A large $\sigma$ over-smooths: labels are averaged over wide regions, lowering variance but introducing bias that bleeds labels across the true boundary. The optimum sits where the graph respects the manifold, connecting within classes while staying sparse across the low-density gap. Crucially, the bias floor is set by the manifold and cluster assumptions themselves: if labels do not vary smoothly along $p(x)$, no choice of $\sigma$ removes the bias, and the method cannot beat a supervised baseline. This is why the diagnostic of comparing against a same-seeds supervised model is not optional, it is the direct test of whether the modeling assumption that justifies transduction actually holds on your data.

61.8 Further Reading

Vapnik, V. (1998). Statistical Learning Theory. The original statement of the transductive setting and the principle of not solving a harder problem than required.
Vapnik, V. (2006). Estimation of Dependences Based on Empirical Data (2nd ed.). Extended treatment of transduction and its generalization bounds.
Joachims, T. (1999). “Transductive Inference for Text Classification using Support Vector Machines.” The practical TSVM training algorithm.
Chapelle, O., Sindhwani, V., and Keerthi, S. S. (2008). “Optimization Techniques for Semi-Supervised Support Vector Machines.” Survey and comparison of TSVM optimizers.
Zhu, X., and Ghahramani, Z. (2002). “Learning from Labeled and Unlabeled Data with Label Propagation.” The propagate-and-clamp algorithm used in the demo.
Zhu, X., Ghahramani, Z., and Lafferty, J. (2003). “Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions.” The harmonic-function view and class-mass normalization.
Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Schölkopf, B. (2004). “Learning with Local and Global Consistency.” The label-spreading algorithm with symmetric normalization.
Chapelle, O., Schölkopf, B., and Zien, A. (eds.) (2006). Semi-Supervised Learning. Standard reference connecting transduction to the broader SSL landscape.

# Transductive Learning {#sec-transductive-learning} ```{r} #| include: false source("_common.R") ``` ## Intuition Most supervised learning quietly asks for more than the task at hand needs. You fit a model on training data, and the model is a *rule*: hand it any future input and it returns a prediction. That rule is reusable, which sounds like a virtue. But building a rule that works everywhere is harder than answering the specific question you were actually asked. Often the question is narrow: here is a fixed batch of records, please label these and only these. Consider a few concrete shapes of that narrow question. A genomics lab has sequenced a particular cohort of patients and wants disease-risk calls for those patients, not a general clinical tool. A news desk has scraped the articles published today and wants each one tagged by topic, not a classifier for all articles ever written. A fraud team has a snapshot of this morning's pending transactions and wants a verdict on each before the batch clears. In every case the set of points needing a prediction is known, in full, at training time, including their feature vectors. The labels are missing; the inputs are not. Transductive learning is the discipline of exploiting exactly that situation. Instead of learning a general function $f$ and then evaluating it at the test points, it predicts the labels of the specific test points directly, using their features as part of the fitting process. The unlabeled test inputs are not held back, they are placed on the table alongside the labeled examples and allowed to shape the answer. ::: {.callout-important title="Key idea"} Inductive learning answers "what is the rule for any future point?" Transductive learning answers "what are the labels of these particular points I already hold?" The second is a smaller question, and you should not solve a harder problem than the one you have. ::: That last sentence is Vapnik's principle, and it is the philosophical engine of this chapter. Vapnik (1998) put it sharply: when solving a problem of interest, do not solve a more general problem as an intermediate step. Estimating a function over the whole input space, only to read off its values at a handful of points, is a detour. If you know the points in advance, aim straight at them. This chapter develops that idea in three settings. First the formal distinction between induction and transduction, with the notation pinned down. Then transductive support vector machines, which push a margin through low-density regions defined partly by the test points. Then graph-based label propagation and label spreading, where labels diffuse from a few labeled seeds across a similarity graph built from labeled and unlabeled points together. We connect all of this to semi-supervised learning (@sec-semi-supervised-learning), since transduction is one of its two faces, and we are honest about the limits: a purely transductive method gives you labels for the batch and nothing for genuinely new arrivals. We close with a runnable label-propagation demo on a two-moons dataset, with a figure of the propagated labels and an accuracy table. ## Induction versus Transduction We observe a labeled set $L = \{(x_1, y_1), \dots, (x_l, y_l)\}$ and an unlabeled set $U = \{x_{l+1}, \dots, x_{l+u}\}$, with inputs $x_i \in \mathbb{R}^p$ and labels $y_i \in \{1, \dots, C\}$. Write $n = l + u$ for the total. In the transductive setting the unlabeled points $U$ are exactly the test points: there is no separate, unseen test set, and the deliverable is the labeling $(\hat y_{l+1}, \dots, \hat y_{l+u})$ of $U$. The two learning modes differ in what they produce. ::: {.callout-note} Inductive learning produces a function $\hat f : \mathbb{R}^p \to \{1, \dots, C\}$ defined on the whole input space, then sets $\hat y_i = \hat f(x_i)$ for any point of interest, seen or unseen. Transductive learning produces only the vector of labels on $U$. There is no $\hat f$ to carry forward. ::: The gap matters statistically. Inductive learning controls the risk of $\hat f$ over the entire distribution $p(x)$, a uniform guarantee across all possible inputs. Transduction needs only that the labeling be good on the finite set $U$, which is a weaker requirement and, by Vapnik's argument, can be met with tighter sample-dependent bounds. Vapnik (1998) and Vapnik (2006) derive transductive generalization bounds that depend on the actual configuration of the $n$ points rather than on the capacity of a function class over all of $\mathbb{R}^p$. A small but important subtlety: a transductive method can always be turned into an inductive one after the fact. Once $U$ has been labeled, the now fully-labeled set $L \cup U$ can train an ordinary supervised model, or you can attach an out-of-sample extension that maps a new point to a prediction by its similarity to the labeled batch. The reverse is cheap too: any inductive $\hat f$ yields transductive labels by evaluation at $U$. The distinction is therefore about what the method optimizes and what guarantees it carries, not an impassable wall. ### Why the transductive bound can be tighter {#sec-transductive-learning-bound} The statistical advantage of transduction is concrete, not merely philosophical, and it is worth making precise. Fix the full pool of $n = l + u$ points and their inputs. In the transductive protocol the labeled set $L$ is a random subset of size $l$ drawn *without replacement* from this finite pool, and the test set is the complementary $u$ points. Because the pool is fixed, the relevant complexity is not the capacity of a function class over all of $\mathbb{R}^p$ but the number of distinct labelings the hypothesis class can realize *on these $n$ points*, a quantity bounded by the empirical growth function $\mathcal{N}(\mathcal{H}, n)$. The driving probabilistic fact is a concentration inequality for sampling without replacement. Let $a_1, \dots, a_n$ be fixed numbers (here, the per-point losses of a fixed labeling) with mean $\bar a = \tfrac1n \sum_i a_i$, and let $\bar a_L = \tfrac1l \sum_{i \in L} a_i$ be the average over a random size-$l$ subset. Hoeffding's inequality for sampling without replacement gives, for $a_i \in [0,1]$, $$ \Pr\!\left( \bar a_L - \bar a \ge \epsilon \right) \le \exp\!\left( -\frac{2\, l\, n}{u}\, \epsilon^2 \right), $$ which is *sharper* than the with-replacement bound $\exp(-2 l \epsilon^2)$ by the finite-population correction factor $1/(1 - l/n) = n/u > 1$, giving the larger exponent $2 l n / u$ and hence a tighter bound (reflecting the reduced variance of sampling without replacement). The gap between the labeled (empirical) error and the unlabeled (test) error of any fixed labeling is therefore controlled by $n/(l u) = (1/l)/(1 - l/n)$ rather than $1/l$. Taking a union bound over the $\mathcal{N}(\mathcal{H}, n)$ distinct labelings yields a bound of the schematic form $$ R_u(h) \;\le\; R_l(h) \;+\; O\!\left( \sqrt{ \frac{ \log \mathcal{N}(\mathcal{H}, n) }{ l\,(1 - l/n) } } \right) \;=\; R_l(h) \;+\; O\!\left( \sqrt{ \frac{ n\,\log \mathcal{N}(\mathcal{H}, n) }{ l\, u } } \right), \qquad $$ {#eq-transductive-learning-tbound} where $R_l$ and $R_u$ are the error rates on the labeled and unlabeled portions (Vapnik 1998; Vapnik 2006). Two features distinguish @eq-transductive-learning-tbound from a standard inductive bound. First, $\log \mathcal{N}(\mathcal{H}, n)$ is data-dependent: it counts only the labelings achievable on the actual configuration of points, which can be far smaller than a worst-case VC term when the points cluster. Second, the finite-population factor $1/(1 - l/n)$ is close to $1$ in the typical transductive regime $l \ll n$ (so $u \approx n$), where the bound is at its tightest; it grows as $l \to n$ because then only a tiny unlabeled set $u \to 0$ remains to be predicted, which is intrinsically high-variance. This is the formal content of "do not solve a harder problem than the one you have": the inductive bound must hold uniformly over $p(x)$, while @eq-transductive-learning-tbound only governs the $n$ points on the table. ::: {.callout-tip title="Intuition"} Picture a connect-the-dots puzzle. Induction is drawing the full underlying curve so you could place a dot anywhere. Transduction is filling in only the numbered dots you were given. If all you need are those dots, drawing the entire curve first is wasted effort, and worse, you might draw the curve badly in regions you were never asked about and let that error leak back into the dots you cared about. ::: ## Transductive Support Vector Machines The first concrete transductive method extends the support vector machine. Recall the standard (inductive) soft-margin SVM for a linear decision function $f(x) = w^\top x + b$ with labels $y_i \in \{-1, +1\}$. It minimizes $$ \min_{w, b, \xi}\; \tfrac{1}{2}\lVert w \rVert^2 + C \sum_{i=1}^{l} \xi_i \quad \text{s.t.} \quad y_i (w^\top x_i + b) \ge 1 - \xi_i,\; \xi_i \ge 0 , $$ where $\xi_i$ are slack variables and $C$ trades margin width against training error. This uses only the labeled points $L$. (See the support vector machines chapter, @sec-svm, for the full development.) The transductive SVM (TSVM) of Vapnik (1998), made practical by Joachims (1999), brings the unlabeled test points into the objective. The idea rests on the *cluster assumption*: the decision boundary should fall in a low-density region of $p(x)$, so it should not slice through a dense clump of unlabeled points. We therefore treat the unknown labels $y^*_{l+1}, \dots, y^*_{l+u}$ of $U$ as additional optimization variables and ask the margin to be large for the unlabeled points too: $$ \min_{w, b, \xi, \xi^*, \, y^*_{j}}\; \tfrac{1}{2}\lVert w \rVert^2 + C \sum_{i=1}^{l} \xi_i + C^* \sum_{j=l+1}^{l+u} \xi^*_j $$ subject to, for the labeled points, $$ y_i (w^\top x_i + b) \ge 1 - \xi_i, \quad \xi_i \ge 0, \qquad i = 1, \dots, l , $$ and for the unlabeled points with their assigned labels $y^*_j \in \{-1, +1\}$, $$ y^*_j (w^\top x_j + b) \ge 1 - \xi^*_j, \quad \xi^*_j \ge 0, \qquad j = l+1, \dots, l+u . $$ The new term $C^* \sum_j \xi^*_j$ penalizes unlabeled points that land inside the margin, so the optimizer prefers a hyperplane that keeps the whole unlabeled cloud clear of the boundary. The penalty weight $C^*$ governs how much the test points are allowed to bend the solution. ::: {.callout-tip title="Intuition"} A plain SVM positions a fence using only the flagged animals on each side. A TSVM also sees the unflagged herd milling around and refuses to run the fence straight through the middle of a tight group, because a good boundary should sit in the empty corridor between herds, not split one in two. ::: ### The effective loss and low-density separation {#sec-transductive-learning-tsvm-loss} The discrete labels can be eliminated to expose what the TSVM really penalizes on the unlabeled points. For a single unlabeled point $x_j$ the inner optimization over its label is $$ \min_{y^*_j \in \{-1,+1\}}\; \min_{\xi^*_j \ge 0}\; \xi^*_j \;\;\text{s.t.}\;\; y^*_j (w^\top x_j + b) \ge 1 - \xi^*_j . $$ For a fixed sign of $y^*_j$ the optimal slack is the hinge $\xi^*_j = \max(0,\, 1 - y^*_j f(x_j))$ with $f(x) = w^\top x + b$. Minimizing over the two label choices gives the closed form $$ \min_{y^*_j} \max\!\big(0,\, 1 - y^*_j f(x_j)\big) \;=\; \max\!\big(0,\, 1 - |f(x_j)|\big) , \qquad $$ {#eq-transductive-learning-hatloss} since the better of the two labels is $y^*_j = \operatorname{sign} f(x_j)$. The unlabeled term in the TSVM objective is therefore $C^* \sum_j \max(0, 1 - |f(x_j)|)$. This function, often called the *hat loss* or symmetric hinge, is plotted as a function of the signed output $f(x_j)$ as an inverted "$\wedge$": it is zero when $|f(x_j)| \ge 1$ (the point sits outside the margin, either side) and grows as the point approaches the boundary $f = 0$. ::: {.callout-note} @eq-transductive-learning-hatloss makes the cluster assumption operational. The hat loss is large exactly for unlabeled points near the decision boundary, so minimizing it pushes the boundary *away* from where unlabeled points are dense, that is, into a low-density region of $p(x)$. This is the precise sense in which a TSVM implements low-density separation. The non-convexity of the whole problem is now visible too: the hat loss is non-convex in $f(x_j)$ (it decreases then increases), so summing it over unlabeled points creates the many local minima that plague TSVM training. ::: Two further structural points. First, without a constraint the optimizer can make the unlabeled term vanish trivially by sending all unlabeled points to one side (giving them all the same label and a large margin), so practical TSVMs impose a *balancing constraint* $$ \frac{1}{u} \sum_{j=l+1}^{l+u} \max\!\big(0,\, \operatorname{sign} f(x_j)\big) \;=\; r , $$ forcing the predicted positive fraction on $U$ to match an assumed ratio $r$ (often the labeled-set positive fraction). Second, @eq-transductive-learning-hatloss connects the TSVM to *entropy regularization*: penalizing $\max(0, 1 - |f|)$ discourages confident-margin-violating outputs in the same spirit that an entropy penalty $-\sum_c p_c \log p_c$ on a probabilistic model pushes posteriors toward $0/1$ in low-density regions (Grandvalet and Bengio 2005). Both are surrogates for the same prior: the boundary belongs where data is sparse. The catch is that the labels $y^*_j$ are discrete unknowns, so the problem is no longer a convex quadratic program; it is a mixed-integer, non-convex optimization, NP-hard in general. Joachims (1999) gives a label-switching heuristic ($\mathrm{SVM}^{light}$): start from the labels an inductive SVM would assign, anneal $C^*$ upward from a small value, and iteratively flip pairs of unlabeled labels whenever doing so lowers the objective. Other approaches relax the integer labels continuously (Chapelle, Sindhwani, and Keerthi, 2008). All of them inherit sensitivity to initialization and to the assumed class balance on $U$. ::: {.callout-warning} TSVMs are notoriously fiddly. The non-convex objective has many local minima, and if the cluster assumption fails (classes overlap heavily, or the test cloud is not cleanly separable) the unlabeled term can drag the boundary into a worse place than the labeled-only SVM. Treat $C^*$ and the assumed positive fraction on $U$ as hyperparameters to validate, not constants to trust. ::: We do not run a TSVM in the demo below because the off-the-shelf R tooling for it is thin and the optimization is finicky; instead we use graph-based propagation, which is convex, transparent, and easy to implement in base R. The TSVM remains worth knowing as the canonical low-density-separation transductive method. ## Graph-Based Label Propagation Graph-based methods take a different route to the same goal. They lean on the *manifold assumption*: the data lie near a low-dimensional manifold, and labels vary smoothly along it, so two points close on the manifold should share a label even if other measures of distance disagree. We encode "closeness" in a graph and let labels diffuse along its edges. ### Building the graph Put all $n = l + u$ points (labeled and unlabeled together) as nodes of a weighted graph. The edge weight between points $i$ and $j$ measures similarity, commonly a Gaussian (radial basis function) kernel on Euclidean distance: $$ W_{ij} = \exp\!\left( -\frac{\lVert x_i - x_j \rVert^2}{2 \sigma^2} \right), \qquad W_{ii} = 0 , $$ where the bandwidth $\sigma$ sets the scale of "near." Large $\sigma$ connects almost everything; small $\sigma$ keeps only the closest neighbors linked. In practice one often sparsifies $W$ to a $k$-nearest-neighbor graph, but the dense kernel is enough to see the mechanism. Define the diagonal degree matrix $D$ with $D_{ii} = \sum_j W_{ij}$. ### The label-propagation algorithm Zhu and Ghahramani (2002) propose the following. Encode labels as an $n \times C$ matrix $F$, where row $i$ is a distribution over the $C$ classes; for labeled points fix that row to the one-hot indicator of the true class. Form the row-normalized transition matrix $$ P = D^{-1} W, \qquad P_{ij} = \frac{W_{ij}}{\sum_k W_{ik}} , $$ so $P_{ij}$ is the probability of stepping from $i$ to $j$ in one move of a random walk on the graph. Then iterate two steps to convergence: 1. Propagate: $F \leftarrow P F$. Each node averages its neighbors' current label distributions, weighted by similarity. 2. Clamp: reset the rows of $F$ for labeled points back to their known one-hot labels. Clamping is what makes the labeled seeds act as permanent sources: every sweep lets their influence flow one more hop into the unlabeled mass, and resetting them prevents that influence from being diluted away. Iterating drives $F$ to a fixed point. The final label of an unlabeled point is $\hat y_i = \arg\max_c F_{ic}$. There is a clean closed form. Partition $W$, $P$, and $F$ into labeled ($\ell$) and unlabeled ($u$) blocks, $$ P = \begin{pmatrix} P_{\ell\ell} & P_{\ell u} \\ P_{u\ell} & P_{uu} \end{pmatrix}, \qquad F = \begin{pmatrix} F_\ell \\ F_u \end{pmatrix}, $$ with $F_\ell$ held fixed at the known labels $Y_\ell$. At the fixed point the unlabeled rows satisfy $F_u = P_{u\ell} Y_\ell + P_{uu} F_u$, which solves to $$ F_u = (I - P_{uu})^{-1} P_{u\ell}\, Y_\ell . $$ This is the *harmonic* solution of Zhu, Ghahramani, and Lafferty (2003): the propagated label scores are a harmonic function on the graph, minimizing the energy $\tfrac{1}{2} \sum_{i,j} W_{ij} (F_i - F_j)^2$ subject to matching the labeled values. Smoothness over the graph and fixed seeds together pin down a unique answer. ### Deriving the harmonic solution {#sec-transductive-learning-harmonic} It is worth deriving the closed form from the energy directly, because the same machinery (the graph Laplacian) recurs throughout graph-based learning. Work with one label column $f \in \mathbb{R}^n$ at a time (the columns of $F$ separate). The smoothness energy is $$ E(f) \;=\; \frac{1}{2} \sum_{i,j} W_{ij}\,(f_i - f_j)^2 . $$ Expanding the square and using $D_{ii} = \sum_j W_{ij}$, $$ E(f) = \frac{1}{2}\sum_{i,j} W_{ij}\,(f_i^2 - 2 f_i f_j + f_j^2) = \sum_i D_{ii} f_i^2 - \sum_{i,j} W_{ij} f_i f_j = f^\top (D - W)\, f = f^\top L f , $$ where $L = D - W$ is the (unnormalized) *graph Laplacian*. The energy is a quadratic form in the Laplacian, and since $E(f) \ge 0$ for all $f$, $L$ is positive semidefinite. Minimizing $E$ subject to clamping the labeled entries $f_\ell = y_\ell$ is a constrained quadratic problem. Partition $L$ in the labeled/unlabeled blocks, $$ L = \begin{pmatrix} L_{\ell\ell} & L_{\ell u} \\ L_{u\ell} & L_{uu} \end{pmatrix}, \qquad f = \begin{pmatrix} y_\ell \\ f_u \end{pmatrix} . $$ Only $f_u$ is free, so we set the gradient of $E$ with respect to $f_u$ to zero. Writing $E = f^\top L f$ and differentiating the block form, $$ \frac{\partial E}{\partial f_u} = 2\,\big(L_{u\ell}\, y_\ell + L_{uu}\, f_u\big) = 0 \quad\Longrightarrow\quad f_u = -\,L_{uu}^{-1} L_{u\ell}\, y_\ell . \qquad $$ {#eq-transductive-learning-harm} The minimizer exists and is unique whenever $L_{uu}$ is invertible, which holds exactly when every connected component of the graph contains at least one labeled point (otherwise an unlabeled component is free to take any constant value at zero energy, and the solution is undetermined). This is the algebraic reason disconnected, seed-free components are a silent failure mode. To see that @eq-transductive-learning-harm is the same fixed point as propagate-and-clamp, recall $P = D^{-1} W$ so that $I - P_{uu} = D_u^{-1}(D_u - W_{uu}) = D_u^{-1} L_{uu}$ and $P_{u\ell} = D_u^{-1} W_{u\ell} = -D_u^{-1} L_{u\ell}$ (using $L_{u\ell} = -W_{u\ell}$ since the off-diagonal labeled/unlabeled block of $D$ is zero). Hence $$ (I - P_{uu})^{-1} P_{u\ell} = (D_u^{-1} L_{uu})^{-1}(-D_u^{-1} L_{u\ell}) = -L_{uu}^{-1} L_{u\ell} , $$ so $(I - P_{uu})^{-1} P_{u\ell} y_\ell = -L_{uu}^{-1} L_{u\ell} y_\ell = f_u$. The iterative algorithm and the energy minimizer coincide. The name "harmonic" is literal. At the optimum each unlabeled coordinate satisfies the discrete mean-value property: from $L_{uu} f_u + L_{u\ell} y_\ell = 0$, the $i$-th unlabeled equation reads $D_{ii} f_i = \sum_j W_{ij} f_j$, that is, $$ f_i = \frac{1}{D_{ii}} \sum_{j} W_{ij}\, f_j = \sum_j P_{ij}\, f_j , $$ so every unlabeled value is the weighted average of its neighbors. This is the discrete analogue of $\nabla^2 f = 0$, and it also gives a probabilistic reading: $f_i$ equals the probability that a random walk started at $i$, with transition matrix $P$, reaches a positively-labeled seed before a negatively-labeled one (the *absorbing* random walk interpretation of Zhu, Ghahramani, and Lafferty 2003). ### Convergence of propagate-and-clamp {#sec-transductive-learning-prop-conv} The iteration $f_u \leftarrow P_{uu} f_u + P_{u\ell} y_\ell$ converges geometrically. Its iteration matrix is $P_{uu}$, a submatrix of the stochastic matrix $P$ with some row sums strictly less than one (any unlabeled node adjacent to a seed leaks probability mass into the clamped block). For a connected graph with at least one seed, the spectral radius satisfies $\rho(P_{uu}) < 1$, so the affine map is a contraction and $$ \lVert f_u^{(t)} - f_u^\star \rVert \le \rho(P_{uu})^{\,t}\, \lVert f_u^{(0)} - f_u^\star \rVert . $$ The error contracts by a factor $\rho(P_{uu})$ per sweep, so reaching tolerance $\varepsilon$ takes $O\!\big(\log(1/\varepsilon) / \log(1/\rho(P_{uu}))\big)$ sweeps. The gap $1 - \rho(P_{uu})$ shrinks when seeds are sparse or poorly connected, which is why few, scattered labels both slow convergence and weaken the solution. ::: {.callout-tip title="Intuition"} Drop ink (the labeled colors) into a few cups of a tray of connected water cells; thicker tubes (larger $W_{ij}$) between cells let color flow faster. Keep re-inking the source cups every round so they never fade. After things settle, each cell's tint is a weighted blend of the sources reachable through the tubing, and the dominant tint is its predicted label. ::: ### Label spreading Label propagation clamps the labeled points hard, which assumes the given labels are exactly right. Zhou et al. (2004) propose label spreading, a softer variant that tolerates label noise and uses a symmetric normalization. Define the symmetrically normalized weight matrix $$ S = D^{-1/2} W D^{-1/2} , $$ and iterate $$ F \leftarrow \alpha S F + (1 - \alpha) Y , $$ where $Y$ is the initial label matrix (one-hot on labeled rows, zero on unlabeled rows) and $\alpha \in (0, 1)$ balances trusting the graph ($\alpha$ near $1$) against trusting the original labels ($\alpha$ near $0$). This iteration is a contraction and converges to the closed form $$ F^* = (1 - \alpha)\,(I - \alpha S)^{-1} Y . $$ ### Deriving label spreading from a regularized objective {#sec-transductive-learning-spread} Both the closed form and the convergence proof fall out of a single regularization functional. Zhou et al. (2004) define, for each label column $f$, $$ Q(f) = \underbrace{\frac{1}{2} \sum_{i,j} W_{ij} \left( \frac{f_i}{\sqrt{D_{ii}}} - \frac{f_j}{\sqrt{D_{jj}}} \right)^2}_{\text{smoothness}} \;+\; \underbrace{\frac{\mu}{2} \sum_i (f_i - y_i)^2}_{\text{fitting}} , \qquad $$ {#eq-transductive-learning-spreadobj} where $y$ is the initial label vector and $\mu > 0$ trades smoothness against fidelity to the given labels. The smoothness term is the quadratic form of the *normalized* Laplacian: expanding it gives $f^\top (I - S) f$ with $S = D^{-1/2} W D^{-1/2}$, because the degree normalization $f_i / \sqrt{D_{ii}}$ converts $D - W$ into $I - S$. So $$ Q(f) = f^\top (I - S) f + \frac{\mu}{2}\,\lVert f - y \rVert^2 . $$ Setting $\nabla Q(f) = 0$, $$ 2(I - S) f + \mu (f - y) = 0 \;\Longrightarrow\; f^\star = \big(I - S + \tfrac{\mu}{2} I\big)^{-1} \tfrac{\mu}{2}\, y . $$ Define $\alpha = \dfrac{1}{1 + \mu/2}$, equivalently $\mu/2 = (1-\alpha)/\alpha$. Multiplying numerator and denominator by $\alpha$, $$ f^\star = \alpha\big(\alpha(I - S) + (1-\alpha) I\big)^{-1} \tfrac{1-\alpha}{\alpha}\, y = (1 - \alpha)\,(I - \alpha S)^{-1} y , $$ which is exactly the stated closed form. The parameter $\alpha \in (0,1)$ is thus a reparameterization of the smoothness/fidelity weight $\mu$: $\alpha \to 1$ means $\mu \to 0$ (trust the graph) and $\alpha \to 0$ means $\mu \to \infty$ (trust the labels, recovering hard clamping in the limit). The contraction claim is now immediate. The iteration $F \leftarrow \alpha S F + (1-\alpha) Y$ has iteration matrix $\alpha S$. The symmetric matrix $S = D^{-1/2} W D^{-1/2}$ is similar to the stochastic matrix $P = D^{-1} W$ (since $S = D^{1/2} P D^{-1/2}$), so they share eigenvalues, all lying in $[-1, 1]$; hence $\rho(S) \le 1$ and $\rho(\alpha S) = \alpha\, \rho(S) \le \alpha < 1$. The map is a contraction with factor at most $\alpha$, so the Neumann series converges, $$ F^{(t)} = (\alpha S)^t F^{(0)} + (1-\alpha) \sum_{k=0}^{t-1} (\alpha S)^k Y \;\xrightarrow[t\to\infty]{}\; (1-\alpha)\sum_{k=0}^{\infty} (\alpha S)^k Y = (1-\alpha)(I - \alpha S)^{-1} Y , $$ independently of the initialization $F^{(0)}$. The series form also exposes the mechanism: the contribution of $Y$ propagated over $k$ hops is geometrically discounted by $\alpha^k$, so label spreading is a soft, distance-decayed diffusion in contrast to label propagation's hard absorbing walk. ::: {.callout-note} The two methods differ in two knobs. Propagation uses the random-walk normalization $P = D^{-1}W$ and clamps seeds exactly; spreading uses the symmetric normalization $S = D^{-1/2} W D^{-1/2}$ and only pulls toward seeds with strength $1 - \alpha$. When labels are clean and few, propagation is the natural default; when some seed labels may be wrong, spreading's soft clamping is safer. ::: Both methods are inherently transductive: they return labels for the $n$ nodes that were in the graph and nothing else. A point that arrives after the graph is built has no row in $F$. To label it you must either rebuild the graph with the new point included and re-solve (back to transduction on a larger batch) or add an out-of-sample extension, for example a Nadaraya-Watson average of neighbor labels (@sec-kernel) using the same kernel. ## Connection to Semi-Supervised Learning Transduction and semi-supervised learning (SSL) overlap heavily but are not identical, and keeping them straight avoids confusion. SSL (@sec-semi-supervised-learning) is defined by its *data*: a little labeled, a lot unlabeled. Transduction versus induction is a distinction about the *goal*: label this fixed batch, or learn a reusable rule. Every transductive method is doing semi-supervised learning, since it uses unlabeled points during fitting. But not every SSL method is transductive: self-training and co-training (covered in the SSL chapter) produce a reusable inductive classifier, while label propagation and TSVMs produce a batch labeling. ::: {.callout-important title="Key idea"} SSL describes the data regime (few labels, many unlabeled). Transduction describes the target (a fixed set of points to label, no model for the future). They co-occur because the cheapest way to exploit unlabeled data is often to label it directly, which is exactly transduction. The two assumptions that make SSL work, the cluster assumption and the manifold assumption, are the same two assumptions that justify TSVMs and graph propagation respectively. ::: The practical upshot: if your unlabeled data *is* the test set and you will never see those exact points again as a population worth modeling, reach for a transductive method and skip the detour of fitting a general $\hat f$. If you need to score a stream of future points, you want induction, and SSL's inductive methods (or a transductive method followed by retraining a supervised model on the filled-in labels) are the right tools. ## Runnable Demo: Label Propagation on Two Moons We now implement label propagation from scratch in base R on a two-moons dataset, the standard stress test for the manifold assumption: two interleaving crescents that no linear boundary separates but that are obvious to a human eye. We reveal only a couple of labeled seeds per moon and let the algorithm propagate labels to every remaining point, treating those points as the known transductive test set. The first chunk generates the data and reveals four seeds (two per class). ```{r transductive-learning-setup} set.seed(1301) # Generate a two-moons dataset: two interleaving half-circles. make_two_moons <- function(n_per, noise = 0.12) { t_top <- seq(0, pi, length.out = n_per) x_top <- cbind(cos(t_top), sin(t_top)) t_bot <- seq(0, pi, length.out = n_per) x_bot <- cbind(1 - cos(t_bot), -sin(t_bot) - 0.5) X <- rbind(x_top, x_bot) X <- X + matrix(rnorm(length(X), sd = noise), ncol = 2) y <- c(rep(1L, n_per), rep(2L, n_per)) list(X = X, y = y) } n_per <- 100 dat <- make_two_moons(n_per) X <- dat$X y_true <- dat$y n <- nrow(X) # Reveal only a few labeled seeds; everything else is the transductive test set. seeds_per_class <- 2 seed_idx <- c( sample(which(y_true == 1L), seeds_per_class), sample(which(y_true == 2L), seeds_per_class) ) is_labeled <- rep(FALSE, n) is_labeled[seed_idx] <- TRUE cat("Total points:", n, "| labeled seeds:", sum(is_labeled), "| unlabeled (transductive test):", sum(!is_labeled), "\n") ``` The second chunk builds the Gaussian-kernel graph, runs propagate-and-clamp to convergence, and reads off the transductive labels. ```{r transductive-learning-propagate} # Gaussian (RBF) similarity graph over ALL points, labeled and unlabeled. sq_dist <- as.matrix(dist(X))^2 sigma <- 0.30 W <- exp(-sq_dist / (2 * sigma^2)) diag(W) <- 0 # Row-normalized random-walk transition matrix P = D^{-1} W. D <- rowSums(W) P <- W / D # Label matrix F: one row per point, one column per class (C = 2). C <- 2L F_mat <- matrix(0, nrow = n, ncol = C) for (i in which(is_labeled)) F_mat[i, y_true[i]] <- 1 # Save the clamped seed rows so we can reset them each sweep. Y_seed <- F_mat[is_labeled, , drop = FALSE] # Iterate: propagate (F <- P F) then clamp the labeled rows. for (iter in 1:1000) { F_new <- P %*% F_mat F_new[is_labeled, ] <- Y_seed # clamp seeds back to known labels if (max(abs(F_new - F_mat)) < 1e-9) { F_mat <- F_new break } F_mat <- F_new } # Transductive prediction: argmax over classes for every point. y_prop <- max.col(F_mat, ties.method = "first") # Accuracy on the unlabeled (test) points only. test_idx <- which(!is_labeled) acc_prop <- mean(y_prop[test_idx] == y_true[test_idx]) cat("Converged in", iter, "iterations.", "Transductive accuracy on unlabeled points:", round(acc_prop, 3), "\n") ``` We can confirm numerically that the iterative propagate-and-clamp result equals the harmonic closed form @eq-transductive-learning-harm derived above, $F_u = -L_{uu}^{-1} L_{u\ell} Y_\ell$ with $L = D - W$. ```{r transductive-learning-verify} # Verify: iterative fixed point == harmonic closed form via the Laplacian. L_mat <- diag(D) - W # unnormalized graph Laplacian u_idx <- which(!is_labeled) l_idx <- which(is_labeled) F_u_closed <- -solve(L_mat[u_idx, u_idx]) %*% L_mat[u_idx, l_idx] %*% Y_seed cat("max abs difference (iteration vs harmonic closed form):", format(max(abs(F_mat[u_idx, ] - F_u_closed)), digits = 3), "\n") ``` ::: {.callout-note} The bandwidth $\sigma$ is the one knob that matters most here. Too small and the graph fragments into disconnected pieces that the seeds cannot reach (those points keep their initial all-zero rows and get labeled arbitrarily); too large and the two moons fuse into one blob and labels bleed across the gap. We use $\sigma = 0.30$, which connects each crescent internally while keeping the crescents weakly linked. ::: ### Baseline comparison To show what the unlabeled structure buys us, we compare against a purely supervised baseline that sees only the four seeds: a 1-nearest-neighbor classifier (@sec-knn) trained on the seeds and applied to the same test points. With so few labels and a curved boundary, the geometry-blind baseline should struggle where propagation succeeds. ```{r transductive-learning-baseline} # 1-NN trained ONLY on the labeled seeds, predicting the unlabeled test points. nn1 <- function(train_X, train_y, test_X) { apply(test_X, 1, function(q) { d <- colSums((t(train_X) - q)^2) train_y[which.min(d)] }) } y_knn <- nn1(X[is_labeled, , drop = FALSE], y_true[is_labeled], X[test_idx, , drop = FALSE]) acc_knn <- mean(y_knn == y_true[test_idx]) results <- data.frame( Method = c("1-NN on seeds only (inductive baseline)", "Label propagation (transductive)"), `Uses unlabeled points` = c("No", "Yes"), `Test accuracy` = c(round(acc_knn, 3), round(acc_prop, 3)), check.names = FALSE ) results ``` @tbl-transductive-learning-accuracy reports the accuracy of both methods on the unlabeled test points. ```{r tbl-transductive-learning-accuracy} knitr::kable( results, caption = "Test-set accuracy on the unlabeled two-moons points: a supervised 1-nearest-neighbor classifier using only the four labeled seeds, versus transductive label propagation that also uses the geometry of the unlabeled points." ) ``` @fig-transductive-learning-moons shows the outcome. The left panel is the input the algorithm sees: a sea of grey unlabeled points with four colored seeds. The right panel is the transductive result: every point colored by its propagated label, with the seeds ringed in black. Propagation follows each crescent around its curve, which a seed-only classifier cannot do. ```{r fig-transductive-learning-moons, fig.cap="Label propagation on two moons. Left: the transductive input, four labeled seeds (colored, ringed) among many unlabeled points (grey). Right: propagated labels for every point, colored by predicted class with seeds ringed in black. Labels flow along each crescent, recovering the curved structure from only four seeds.", fig.width=9, fig.height=4.5} op <- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1)) cols <- c("#D55E00", "#0072B2") # Left: what the algorithm starts with. plot(X, col = "grey75", pch = 16, cex = 0.7, xlab = expression(x[1]), ylab = expression(x[2]), main = "Input: 4 seeds among unlabeled points") points(X[is_labeled, , drop = FALSE], col = cols[y_true[is_labeled]], pch = 16, cex = 1.8) points(X[is_labeled, , drop = FALSE], col = "black", pch = 1, cex = 2.0, lwd = 1.5) # Right: the propagated transductive labeling. plot(X, col = cols[y_prop], pch = 16, cex = 0.8, xlab = expression(x[1]), ylab = expression(x[2]), main = "Output: propagated transductive labels") points(X[is_labeled, , drop = FALSE], col = "black", pch = 1, cex = 2.0, lwd = 1.5) par(op) ``` The lesson the figure makes concrete: with four labels and no geometry, you cannot recover two interleaving crescents, but with four labels plus the similarity graph over the test points themselves, you can. That is transduction earning its keep. ## Practical Guidance and Pitfalls A few things decide whether transduction helps or quietly hurts. ::: {.callout-tip title="When to use this"} Reach for a transductive method when (1) the points you must label are fixed and known at training time, (2) labels are scarce while unlabeled inputs are plentiful, and (3) you have reason to believe the cluster or manifold assumption holds, that is, the classes form coherent groups or lie along smooth low-dimensional structure. The two-moons demo satisfies all three. ::: The single biggest failure mode is the assumption itself. If the classes overlap heavily, or the manifold structure does not align with the labels, the unlabeled points pull predictions toward structure in $p(x)$ that has nothing to do with $y$. Then label propagation smears labels across the true boundary and a TSVM parks its hyperplane in the wrong corridor. Always sanity-check against a supervised baseline trained on the same labeled points; if the transductive method does not beat it, the assumption is probably violated. Graph construction is the part that most repays care. The bandwidth $\sigma$ (or the neighbor count $k$ in a $k$-NN graph) sets the scale of similarity, and the result can swing from excellent to useless across a factor of two in $\sigma$. A common heuristic sets $\sigma$ to the median pairwise distance among nearby points, or tunes it by cross-validation on whatever labels you can spare. Disconnected components are a silent trap: if a chunk of unlabeled points has no path to any seed, its labels are undetermined and the code will hand back whatever the argmax of an all-zero row happens to be. Check connectivity, and make sure every class has at least one seed in each region it occupies. Class balance matters too. Label propagation has no notion of prior class frequencies, so if the seeds are wildly unrepresentative the propagated proportions can drift. Some implementations add a class-mass normalization step (Zhu, Ghahramani, and Lafferty, 2003) that rescales the columns of $F$ to match a target class distribution; if you know roughly how the batch splits across classes, use it. ::: {.callout-warning} The defining limitation is structural, not a tuning issue. A purely transductive method gives you labels for the batch in hand and no model for genuinely new points. If tomorrow's data arrives as a fresh batch, you must rebuild the graph (or rerun the TSVM) on the new points, or you must add an explicit out-of-sample extension or retrain a supervised model on the filled-in labels. Do not promise stakeholders a deployable scoring service from a transductive fit alone. ::: Finally, watch the cost. The harmonic closed form $F_u = (I - P_{uu})^{-1} P_{u\ell} Y_\ell$ inverts a $u \times u$ matrix, which is $O(u^3)$ time and $O(u^2)$ memory, infeasible for large $u$. Building the dense graph is itself $O(n^2 p)$ for the pairwise distances. The iterative propagate-and-clamp loop is far cheaper per sweep: on a sparse $k$-NN graph with $m = O(kn)$ edges each sweep is a sparse matvec costing $O(m C)$, and from @sec-transductive-learning-prop-conv the number of sweeps to tolerance $\varepsilon$ is $O(\log(1/\varepsilon) / (1 - \rho(P_{uu})))$, so total cost is roughly $O\!\big(k n C \log(1/\varepsilon) / (1-\rho(P_{uu}))\big)$. This is what scales to large unlabeled batches. For label spreading the same analysis applies with iteration matrix $\alpha S$ and contraction factor $\alpha$, giving a guaranteed $O(\log(1/\varepsilon)/\log(1/\alpha))$ sweeps regardless of graph connectivity. ### Class-mass normalization {#sec-transductive-learning-cmn} Because the harmonic solution ignores class priors, the raw argmax of $F_u$ can be badly miscalibrated when seeds are unrepresentative of the true class proportions. Class-mass normalization (Zhu, Ghahramani, and Lafferty 2003) corrects this. Let $q_c$ be the target mass for class $c$ (for example the labeled-set frequency, or a known batch proportion), and let the current column mass be $\sum_{i \in U} F_{ic}$. Rescale each column, $$ \tilde F_{ic} = F_{ic}\cdot \frac{q_c}{\sum_{j \in U} F_{jc}} , $$ and predict $\hat y_i = \arg\max_c \tilde F_{ic}$. This reweights the soft scores so the predicted class frequencies match $q$, which typically improves accuracy when classes are imbalanced or seeds are skewed, at the cost of requiring an estimate of $q$. ### Bias-variance and the role of the assumptions {#sec-transductive-learning-bv} Graph-based transduction trades variance for bias in a way controlled by the bandwidth. A small $\sigma$ (or small $k$) yields a near-diagonal graph: predictions depend on very local neighborhoods, so they are low-bias but high-variance, sensitive to individual noisy points and prone to disconnection. A large $\sigma$ over-smooths: labels are averaged over wide regions, lowering variance but introducing bias that bleeds labels across the true boundary. The optimum sits where the graph respects the manifold, connecting within classes while staying sparse across the low-density gap. Crucially, the *bias floor* is set by the manifold and cluster assumptions themselves: if labels do not vary smoothly along $p(x)$, no choice of $\sigma$ removes the bias, and the method cannot beat a supervised baseline. This is why the diagnostic of comparing against a same-seeds supervised model is not optional, it is the direct test of whether the modeling assumption that justifies transduction actually holds on your data. ## Further Reading - Vapnik, V. (1998). *Statistical Learning Theory*. The original statement of the transductive setting and the principle of not solving a harder problem than required. - Vapnik, V. (2006). *Estimation of Dependences Based on Empirical Data* (2nd ed.). Extended treatment of transduction and its generalization bounds. - Joachims, T. (1999). "Transductive Inference for Text Classification using Support Vector Machines." The practical TSVM training algorithm. - Chapelle, O., Sindhwani, V., and Keerthi, S. S. (2008). "Optimization Techniques for Semi-Supervised Support Vector Machines." Survey and comparison of TSVM optimizers. - Zhu, X., and Ghahramani, Z. (2002). "Learning from Labeled and Unlabeled Data with Label Propagation." The propagate-and-clamp algorithm used in the demo. - Zhu, X., Ghahramani, Z., and Lafferty, J. (2003). "Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions." The harmonic-function view and class-mass normalization. - Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Schölkopf, B. (2004). "Learning with Local and Global Consistency." The label-spreading algorithm with symmetric normalization. - Chapelle, O., Schölkopf, B., and Zien, A. (eds.) (2006). *Semi-Supervised Learning*. Standard reference connecting transduction to the broader SSL landscape.