Almost every interesting question in science, business, and policy comes down to a prediction or an explanation. Will this patient respond to a treatment? Which customers are about to leave? What does this satellite image show? How does class size affect test scores? In each case we have data, some inputs we can observe, and some outcome we care about, and we would like a machine to learn the relationship between them rather than have a person write down the rules by hand. That activity, learning patterns from data so that we can predict or understand outcomes, is what this book is about. It goes by two names that we will treat as interchangeable: machine learning and statistical learning.1

The reason we hand this job to a machine is that the relationships are usually too complicated, too noisy, or too high-dimensional for a human to write down explicitly. A doctor cannot articulate the exact rule that separates a benign from a malignant tumor on an image, but a model trained on thousands of labeled images can approximate one. The central tension running through every method we study is the same: a model that is flexible enough to capture the true pattern is also flexible enough to memorize the noise, and a model simple enough to avoid the noise may miss the pattern. Learning to navigate that tension, the bias-variance tradeoff, is the single most important skill this book tries to teach.

This chapter lays the conceptual foundation for everything that follows. We begin with the vocabulary and the big picture: the kinds of learning problems, the kinds of data, and what “learning” even means. We then set up the general statistical framework (\(Y = f(X) + \epsilon\)) that unifies the later chapters, and we use it to derive how prediction error decomposes into reducible and irreducible parts. From there we turn to the practical questions every analyst faces: how do we measure whether a model is any good (model assessment), how do we estimate that quality honestly using cross-validation and the bootstrap, and how do we improve a model when we have too many predictors (subset selection, shrinkage, and dimension reduction). Linear regression appears throughout as the running example, both because it is the simplest model and because almost every advanced method can be understood as a relaxation of it.

Tip

This is a reference-style chapter. You do not need to read it front to back. If you are new to the field, read the first three sections (the big picture, the general framework, and model accuracy) carefully, then skim the rest and return as needed. If you are experienced, use the headings to jump to the topic you want; the callouts flag the ideas most worth remembering.

For readers who like a visual map of how the methods relate, Figure 1.1 lays out the field at a glance, and the overview chapter that follows (Chapter 2) expands each branch. Two textbooks are the standard references and are cited throughout this book: James et al. (2013) is the most accessible introduction, while Hastie, Friedman, and Tibshirani (2001) is the most mathematically complete.

Show code
graph LR
  ML([Machine Learning])
  ML --> SUP[Supervised learning]
  ML --> UNS[Unsupervised learning]
  ML --> RL[Reinforcement learning]
  SUP --> REG["Regression and classification:<br/>splines, GAMs, trees, ensembles,<br/>SVM, neural networks"]
  UNS --> CLU["Clustering and structure:<br/>k-means, hierarchical,<br/>PCA, t-SNE, UMAP"]
  RL --> POL["Sequential decisions:<br/>bandits, value and policy methods"]
  REG --> DL["Deep learning:<br/>CNNs, RNNs, Transformers, LLMs"]
graph LR
  ML([Machine Learning])
  ML --> SUP[Supervised learning]
  ML --> UNS[Unsupervised learning]
  ML --> RL[Reinforcement learning]
  SUP --> REG["Regression and classification:<br/>splines, GAMs, trees, ensembles,<br/>SVM, neural networks"]
  UNS --> CLU["Clustering and structure:<br/>k-means, hierarchical,<br/>PCA, t-SNE, UMAP"]
  RL --> POL["Sequential decisions:<br/>bandits, value and policy methods"]
  REG --> DL["Deep learning:<br/>CNNs, RNNs, Transformers, LLMs"]
Figure 1.1: A concept map of the field. The branches are the major learning types, and the leaves list representative methods developed in this book.

1.1 The Big Picture

1.1.1 Supervised versus unsupervised learning

The first fork in the road is whether or not your data come with an outcome to predict. In supervised learning we have both inputs and a labeled output, and the goal is to predict or estimate the output from the inputs. In unsupervised learning we have only inputs and no labeled output; the goal is to understand the structure and relationships among the inputs themselves.

Key idea

Supervised learning asks “given these inputs, what is the output?” Unsupervised learning asks “what structure is hidden in these inputs?” Much of this book is about supervised learning, but later parts treat unsupervised learning, reinforcement learning, and many further learning paradigms in their own right. The two basic modes are also complementary: unsupervised methods are often used to reduce the number of inputs before a supervised model is fit.

Two quantities recur constantly and deserve names up front. We write \(n\) for the sample size (the number of observations) and \(p\) for the number of features (the number of input variables). As a rough rule, more observations are almost always better, which is why deep learning practitioners go to such lengths to manufacture extra training examples through data augmentation, and fewer features are usually better, which is one of the main reasons unsupervised dimension-reduction methods exist (they shrink \(p\)).

Table 1.1 organizes the major method families by two questions: is the output (or the structure of interest) continuous or categorical, and is the problem supervised or unsupervised. Use it as a map; each entry is a chapter or section later in the book.

Table 1.1: Major method families, organized by output type (continuous or categorical) and by whether the problem is supervised or unsupervised.
Output type Unsupervised Supervised
Continuous Clustering and dimension reduction (SVD, PCA, k-means) Regression (linear, polynomial); tree-based methods
Categorical Association analysis (Apriori, FP-growth); hidden Markov models Classification (kNN, trees, logistic, naive Bayes, SVM)

1.1.2 What kind of data are you working with?

Before choosing a method it helps to recognize what form your data take, because that often dictates the preprocessing you will need. Data come in three broad forms. Structured data live in neat tables of rows and columns, the kind you can load straight into a data frame. Semi-structured data carry some organizing markup but are not a flat table; XML, JSON, and HTML are typical examples. Unstructured data have no built-in tabular form and must be turned into features before most models can use them; these include text, images (photos and videos), and networks such as social graphs.

1.1.3 What does it mean for a machine to “learn”?

Machine learning is concerned with the development, the analysis, and the application of algorithms that allow computers to learn. The operational definition we will use is simple: a computer learns if it improves its performance at some task as it gains experience, where “experience” means data. More concretely, learning means extracting a model of a system purely from observing it (or from simulating it), where by a model we mean some set of relationships among the variables that describe the system.

Note

A “model” in this book is not a single equation handed down from theory. It is whatever object (a formula, a tree, a network of weights) encodes the relationships we estimated from data. Different chapters use very different objects, but they all play this same role.

We learn for one of two goals, and it is worth being clear which one you are pursuing because it changes how you should build and judge a model. Prediction asks only for accurate outputs on new inputs; the model can be an opaque black box as long as it predicts well. Inference asks how the inputs relate to the output, which variables matter, in what direction, and by how much; here interpretability is the whole point.

Learning is especially worthwhile in situations where writing rules by hand is hopeless. There are four classic cases:

  • Human expertise does not exist, for example navigating a rover on Mars.
  • Humans have the expertise but cannot articulate it, as in speech recognition.
  • The right answer changes over time, as in routing traffic on a computer network.
  • The solution must be tailored to each individual case, as in adapting to a user’s biometrics.

These conditions show up across essentially every applied field. To make that concrete, here are representative tasks by domain:

  • Retail: market basket analysis, customer relationship management (CRM).
  • Finance: credit scoring, fraud detection.
  • Manufacturing: process optimization, troubleshooting.
  • Medicine: medical diagnosis.
  • Telecommunications: quality-of-service optimization, routing.
  • Bioinformatics: sequence motifs, alignment, web mining, search engines.

1.1.4 The workflow and what shapes performance

A learning project moves through four stages, and skipping or rushing any one of them is a common source of disappointing results. First comes data generation, where you decide what data types to collect and how large a sample you need. Next is preprocessing: normalization, handling missing values, and feature selection or extraction. Then comes the machine learning step itself, where you state a hypothesis and choose a learning paradigm and algorithm. Finally there is hypothesis validation, where you use cross-validation to check the model honestly and, if it passes, deploy it.

How well the final model performs is not determined by the algorithm alone. Four factors interact: the type of training data provided, the form and extent of any prior knowledge you can build in, the type of feedback the learner receives, and the learning algorithm itself. It helps to keep two of these factors separate in your mind throughout the book, because nearly every method is a particular choice on each axis: modeling (what family of relationships are we willing to consider?) and optimization (how do we search that family for the best fit?).

Intuition

An algorithm is the engine that searches through possible models and builds up the knowledge structure that represents what was learned. A good learning algorithm extracts the useful, generalizable information from the training examples while ignoring the accidental noise. The whole game is generalization, performing well on data you have not seen, not memorization of the data you have.

This brings us to a foundational caveat. Training is the process of making the machine learn from a training set, and we then judge it on a separate test set. But this only works under assumptions, summarized by the no free lunch principle: there is no single algorithm that is best for every possible problem.

Warning

Every method in this book carries an implicit assumption that the training set and the test set are drawn from the same distribution. When that assumption breaks (the world shifts after you train, or your sample is not representative), even a model that validated beautifully can fail in deployment. No amount of cleverness removes the need for assumptions.

1.1.5 A taxonomy of learning algorithms

We can now place the algorithms on a finer map. Supervised learning works with labeled pairs \(\{ X_n \in R^d, y_n \in R\}^N_{n=1}\) and is used for prediction, splitting into classification when the output is a discrete label and regression when it is a real value. Unsupervised learning works with inputs alone, \(\{ X_n \in R^d\}^N_{n=1}\), and covers clustering, estimating a probability distribution, finding associations among features, and dimension reduction. Reinforcement learning (Chapter 64) is a third basic mode, in which an agent learns to make a sequence of decisions (as in a robot or a chess engine) from feedback rather than from labeled examples. Beyond these, the book devotes an entire part to many further paradigms, among them semi-supervised learning (Chapter 48), which mixes a small amount of labeled data with a large amount of unlabeled data, self-supervised, active, weakly-supervised, few-shot and zero-shot, meta-learning, transfer and multi-task, online, federated, continual, and adversarial learning.


1.2 General Framework

With the vocabulary in place, we can write down the framework that unifies the supervised methods in this book. Let \(Y\) denote the output we want to predict and let \(X = (X_1,...,X_p)\) denote the \(p\) inputs. The starting assumption is an additive error model: the output is some fixed but unknown function of the inputs, plus random noise,

\[ Y = f(x) + \epsilon \]

where

  • \(f(.)\) = unknown fixed function

  • \(\epsilon\) = random error (\(E(\epsilon) = 0; \epsilon \perp X\))

The error term \(\epsilon\) captures everything that affects \(Y\) but is not in \(X\): unmeasured variables, measurement error, and inherent randomness. We assume it averages to zero and is independent of the inputs.

For prediction we do not need to understand \(f\) at all; we just need a good estimate of it, \(\hat{f}\), which we treat as a black box that turns inputs into a prediction,

\[ \hat{Y} = \hat{f}(X) \]

Why can we never predict perfectly? Because two distinct sources of error are in play, and only one of them is under our control:

  • Reducible Error: \(\hat{f}(.)\) can be improved (in principle)

  • Irreducible Error: \(\epsilon\) because it’s not a function of the inputs, which might contain unknown variables or uncontrolled variation

The following decomposition makes this precise. The expected squared prediction error splits cleanly into a piece we can shrink by building a better model and a piece we are stuck with no matter what:

\[ \begin{aligned} E(Y - \hat{Y})^2 &= E[f(X) + \epsilon - \hat{f}(X)]^2 \\ &= [f(X) - \hat{f}(X)]^2 + var(\epsilon) \\ &= \text{reducible} + \text{irreducible} \end{aligned} \]

Key idea

No model, however sophisticated, can do better than the irreducible error \(var(\epsilon)\). This sets a hard floor on accuracy and is a useful sanity check: if your goal sits below that floor, the answer is to collect better inputs, not to fit a fancier model.

For inference, the object of interest is the conditional mean \(f(X) = E(Y|X)\), but the conditional distribution of \(Y|X\) is unknown and so must be estimated. Either way, the practical goal is to find an \(\hat{f}\) such that \(Y \approx \hat{f}(X) \forall (X,Y)\).

There are two broad strategies for estimating \(f\), and the choice has consequences that run through the entire book. The parametric approach commits to a functional form in advance and then estimates a finite set of parameters:

  • Parametric

    1. Make an assumption about the function form of \(f(X)\)

    2. Try to fit data to that specified form (i.e., estimate the parameters) using OLS in regression or MLE in general.

  • Nonparametric

The nonparametric approach makes no such commitment and lets the data determine the shape of \(f\) directly, which is more flexible but requires far more data to pin down.

This choice exposes a tradeoff between prediction accuracy and model interpretability, summarized in Table 1.2. A restrictive model is easier to interpret and better suited to inference, while a flexible model usually predicts better but is harder to explain.

Table 1.2: Restrictive versus flexible models and the tradeoff between interpretability for inference and accuracy for prediction.
Restrictive Model Flexible Model

More inference

More interpretable

More prediction
When to use this

If your stakeholders need to understand and trust the “why” (a regulator, a clinician, a policymaker), lean restrictive and interpretable. If all that matters is the accuracy of the output and the model can stay a black box, you are free to lean flexible.

1.3 Assessing Model Accuracy - Model fit

We cannot improve what we cannot measure, so before studying any specific method we need ways to score how well a fitted model matches reality. The right score depends on whether the output is continuous or categorical.

1.3.1 Continuous output: Mean squared error (MSE)

For a continuous output the natural score is the mean squared error, the average squared gap between the observed values and the model’s predictions:

\[ MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{f}(x_i))^2 \]

The number we actually care about is not the MSE on the data we trained on but the generalization error, a measure of how accurately the model predicts outcomes for a test set it never saw.

Here is the phenomenon that motivates almost everything later in the book. As you make a method more flexible, the training MSE always goes down, because a more flexible model can hug the training points ever more tightly. The test MSE, however, follows a U-shape: it falls at first as the model captures real structure, then rises again once the model starts fitting noise. This U-shape is the visible face of the bias-variance tradeoff, which the next decomposition makes exact:

\[ \begin{aligned} E(MSE) &= E(y_0 - \hat{f}(x_0))^2\\ &= var(\hat{f}(x_0)) + [bias(\hat{f}(x_0))]^2 + var(\epsilon) \\ &= \text{variance} + \text{bias} + \text{irreducible noise} \end{aligned} \]

where \(0\) subscript means training data set. Reading this equation term by term tells the whole story:

  • \(E(MSE) \ge var(\epsilon)\) because \(\epsilon\) is irreducible model error and serves as the lower bound on the quality of potential model fit

  • More flexible means more variance, but less bias.

  • Less flexible (parametric) means less variance, but more bias.

  • Bias-variance tradeoff = Approximation-overfit trade-off

    • As the model gets more complex, it can fit (approximate) the true function better, but it also fits noise better (i.e., overfit)
Intuition

Think of bias as systematic error from a model that is too rigid to capture the truth, and variance as instability, how much your fitted model would jump around if you redrew the training sample. Flexible models bend to fit the data (low bias) but wobble from sample to sample (high variance); rigid models are stable (low variance) but miss the shape (high bias). The art is finding the flexibility that minimizes their sum.

1.3.2 Categorical output: training error rate

When the output is a class label rather than a number, squared error no longer makes sense, so we count mistakes instead. The training error rate is the proportion of misclassifications when \(\hat{f}\) is applied to the training set:

\[ \frac{1}{n} \sum_{i=1}^n I(y_i \neq \hat{y}_i) \]

and test error rate is

\[ \frac{1}{n} \sum_{i=1}^n I(y_0 \neq \hat{y}_0) \]

To know how good a classifier can possibly be, it helps to imagine the best one we could ever build. The Bayes classifier is exactly that ideal: it assigns each observation to the most likely class given its predictor values, that is, the class with the largest conditional probability \(P(Y = j | x_0)\). In words, it assigns \(x_i\) to the class \(j\) for which \(P(Y=j | X=x_0)\) is largest.

Because it always bets on the most probable class, the Bayes classifier achieves the lowest possible test error rate, called the Bayes error rate:

\[ 1 - E_x(max_j [P(Y = j |X)]) \]

This quantity plays the same role for classification that the irreducible error played for regression:

  • where it is greater than 0 if the classes overlap

  • analogous to the irreducible error

Key idea

The Bayes classifier is the gold standard, the lowest error any classifier can reach. We can never build it in practice because we never know \(P(Y|X)\) exactly, so every real classifier is an attempt to approximate that conditional probability.

A concrete example of such an approximation is K-nearest neighbors (KNN), treated in full in Chapter 17, which estimates the conditional probability by simply looking at the labels of the \(K\) closest training points:

\[ P(Y =j |X = x_0 ) \approx \frac{1}{K}\sum_{i \in N_0} I(y_i = j) \]

1.3.3 Curse of dimensionality

Methods like KNN rely on the idea of a “local neighborhood” of nearby points. That idea quietly breaks down as the number of features grows, a problem so important it has its own name: the curse of dimensionality. The arithmetic is striking. Consider a \(p\)-dimensional hypercube of unit volume. To capture a fraction \(r\) of the observations inside a small subcube, the side length of that subcube must average \(e_p(r) = r^{1/p}\). Working out a few values shows how fast the neighborhood explodes:

\[ e_1(.01) = .01 ; e_1(.1) = .1\\ e_{10}(.01) = .63; e_{10} (.1) = .8 \]

In one dimension, capturing 1% of the data needs a window covering 1% of the range. In ten dimensions, capturing that same 1% requires a “subcube” that spans 63% of each axis, which is no longer local in any meaningful sense.

Warning

In high dimensions there is no such thing as a local neighborhood: every point is far from every other point. This is why methods built on local averaging (KNN, kernel smoothers; see Chapter 4) degrade as \(p\) grows, and why dimension reduction and regularization, covered later in this chapter, become necessary rather than optional.

1.3.4 Reminder

Before moving to the harder topics, it is worth pinning down linear regression carefully, because almost every advanced method in this book can be read as a modification of it. Linear regression is the special case of the general framework

\[ Y = f(X) + \epsilon \]

in which \(f\) is linear in the parameters, \(f(X) = \beta_0 + \sum_{j=1}^p X_j \beta_j\). The “linear” refers to the coefficients, not the inputs: each \(X_J\) can itself be many things, which is why the model is far more flexible than it first appears:

  • quantitative inputs

  • transformations of quantitative inputs

  • polynomial terms

  • interactions

  • Discretize (i.e., dummy) variable of quantitative inputs.

We assume additive independent errors.

Whether the model fits well depends heavily on how the sample size \(n\) compares to the number of predictors \(p\). The three regimes in Table 1.3 recur throughout the book and explain why “too many predictors” is a real danger:

Table 1.3: How ordinary least squares behaves in the three regimes of the sample size \(n\) relative to the number of predictors \(p\).
Relationship between n and p Result
\(n >>p\) low bias, low predictive variance, fit well with test data
\(n \approx p\) substantial variability, overfit, fit poorly with test data
\(n <p\) no unique solution, variance is infinite

We choose the coefficients \(\beta\) to minimize the residual sum of squares (RSS), the total squared gap between observed and fitted values, which can be written compactly using matrix notation:

\[ RSS(\beta) = \sum_{i=1}^n(y_i - f(x_i))^2 = \sum_{i=1}^n (y_i -\beta_0 - \sum_{j=1}^p x_{ij} \beta_j)^2 \\ = (\mathbf{y} - \mathbf{X\beta})^T (\mathbf{y} - \mathbf{X\beta}) \equiv ||\mathbf{y} - \mathbf{X\beta}||^2 \]

which is the square of the \(L^2\) norm. Setting the derivative with respect to \(\beta\) to zero,

\[ \mathbf{X}^T (\mathbf{y} - \mathbf{X\beta}) = 0 \]

gives the familiar closed-form solution

\[ \hat{\beta} = \mathbf{X}^T\mathbf{X}^{-1}\mathbf{X}^T \mathbf{y} \]

and the variance of those estimates

\[ var(\hat{\beta}) = (\mathbf{X}^T \mathbf{X})^{-1} \sigma^2 \\ = (\mathbf{X}^T \mathbf{X})^{-1} var(\epsilon) \]

Prediction

The fitted values follow by plugging \(\hat{\beta}\) back in,

\[ \hat{\mathbf{y}} = \mathbf{X} \hat{\beta} = \mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]

where the matrix sandwiched between \(\mathbf{y}\) and nothing,

\[ \mathbf{H} \equiv \mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T \]

is known as the hat matrix, which projects orthogonally the \(\mathbf{y}\) vector onto the subspace of \(\mathbf{X}\). (The hat matrix returns in the section on cross-validation, where its diagonal entries give us leave-one-out errors almost for free.)

The Gauss-Markov theorem tells us that OLS estimates are unbiased and have the smallest variance among all linear unbiased estimates. That sounds like the end of the story, but it is not, and the reason is the bias-variance tradeoff again. For any estimator \(\tilde{\beta}\), the mean squared error decomposes as

\[ MSE(\tilde{\beta}) = E(\tilde{\beta} - \beta)^2 = var(\tilde{\beta} + [E(\tilde{\beta}) - \beta]^2 \\ = variance + bias ^2 \]

Since OLS is an unbiased estimator, \(\tilde{\beta} = \hat{\beta}\).

Key idea

Gauss-Markov only guarantees that OLS is best among unbiased estimators. We could often achieve a smaller MSE by deliberately accepting a little bias in exchange for a large drop in variance. That single observation is the entire justification for the shrinkage methods (ridge, lasso) introduced later in this chapter.

Inference. If we additionally assume normal errors, we can attach uncertainty to the estimates. The error variance is estimated by

\[ \hat{\sigma}^2 = \frac{1}{n - p - 1} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

and assuming that \(\epsilon \sim N(0, \sigma^2)\) then

\[ \hat{\beta} \sim N(\beta, (\mathbf{X}^T \mathbf{X})^{-1}\sigma^2) \]

\[ (n-p-1) \hat{\sigma}^2 \sim \sigma^2 \chi^2_{n-p-1} \]

Warning

If \(p\) is large (many covariates), you will likely find at least one variable significant purely by chance, so an individual partial \(t\)-statistic can mislead you. The \(F\)-test does not suffer from this problem because it accounts for the number of predictors jointly. Treat per-coefficient significance with suspicion in wide datasets.

Real data rarely satisfy the linear model’s assumptions exactly, and it is worth knowing the standard failure modes so you can diagnose them: nonlinearity of the response-predictor relationship, correlated errors, non-constant error variance, high-leverage points, and collinearity.

Tip

Parametric methods tend to outperform nonparametric ones when the number of observations per predictor is small. When data are scarce, the structure imposed by a parametric model is an asset, not a limitation.


1.3.5 Model Assessment

We now turn from “what is the model?” to “how good is it, really?” The generalization performance of a learning method is how well it predicts independent test data, and it is the objective basis for choosing between models. It serves two related jobs that are worth keeping distinct: model selection, choosing among candidate models, and model assessment, estimating the chosen model’s true error.

To formalize this, suppose we have an output variable \(Y\), a vector of inputs \(X\), and a prediction model \(\hat{f}(X)\) estimated from a training set \(T\). We measure the gap between \(Y\) and \(\hat{f}(X)\) with a loss function. The most common choice for continuous outputs is squared-error loss,

\[ L(Y, \hat{f}(X)) = (Y- \hat{f}(X))^2 \]

with absolute error loss as a more robust alternative,

\[ L(Y, \hat{f}(X)) = |Y- \hat{f}(X)| \]

1.3.5.1 Types of Error

There are three distinct error quantities that are easy to confuse but mean genuinely different things, and getting them straight is essential for reading the cross-validation results that follow.

The first is the test error (or generalization error), the prediction error over an independent test sample for this particular fitted model:

\[ Err_T = E_{X^\circ, Y^\circ}[L(Y^\circ, \hat{f}(X^\circ))|T] \]

where \(X\) and \(Y\) are drawn randomly from their joint population distribution and the training set \(T\) is held fixed. So \(Err_T\) is the error when \(f(.)\) is estimated from this one specific training set.


The second is the expected prediction error (expected test error), which goes one step further and averages over all the training sets we might have drawn:

\[ Err = E_T E_{X^\circ, Y^\circ}[L(Y^\circ, \hat{f}(X^\circ))|T] = E_T [Err_T] \]

The third is the training error, the average loss on the very data used to fit the model:

\[ \bar{err} = \frac{1}{n} \sum_{i=1}^n L(y_i, \hat{f}(x_i)) \]

Here is the practical catch. What we usually want is \(Err_T\), the error of our actual fitted model, but estimating it well is nearly impossible without a separate test set. Within a single data set we approximate the validation step either by adjusting the training error (as AIC-style criteria do) or by reusing the sample efficiently (as cross-validation and the bootstrap do). The honest admission is that these tricks give us good estimates of \(Err\), the average over training sets, not of \(Err_T\), the error for our specific one. The bias-variance tradeoff, viewed through a loss function, is really a statement about how well the training error stands in for the test error.

1.3.5.2 Model Error/ Model Selection

The fundamental obstacle is that the training error is optimistic: it is systematically lower than the true test error, because the model has already seen, and partly memorized, the very points it is being scored on. For a deeper treatment of how to quantify this optimism, see (Hastie, Friedman, and Tibshirani 2001) Ch. 7.4. The size of the gap has an intuitive driver:

the amount that \(\bar{err}\) underestimates the true test error depends on how strongly \(y_i\) affects its own prediction. In other words, the larger \(cov(\hat{y}_i, y_i)\), the more optimistic the training error is. And the greater the effective number of parameters (the more flexible the model), the larger this covariance.

This single insight organizes the two families of remedies. (Hastie, Friedman, and Tibshirani 2001) Ch. 7.5 shows that most model-selection criteria (AIC, BIC, \(C_p\)) work by estimating that optimism and adding it back to the training error, producing an unbiased estimate of prediction error. The alternative family, cross-validation and the bootstrap, instead estimates the expected prediction error \(Err\) directly, without modeling the optimism at all.

Note

A model with a larger effective number of parameters is more flexible. The word “effective” matters because for many modern methods the count of tunable knobs is not an integer; the definition below makes it precise.

To make “effective number of parameters” concrete, restrict attention to linear fitting methods. Writing the outputs as

\[ \mathbf{y} = (y_1, \dots, y_n)' \]

a linear fitting method can be expressed as \(\mathbf{\hat{y}} = \mathbf{Sy}\), where \(\mathbf{S}\) is an \(n \times n\) matrix that depends on the inputs \(x_i\) but not on the responses \(y_i\). Ordinary linear regression is the canonical case, with \(\mathbf{S = X(X'X)^{-1}X'}\) (the hat matrix from before).

The effective number of parameters, or degrees of freedom, is then just the trace of that matrix:

\[ df(\hat{\mathbf{y}}) = df (\mathbf{S}) = trace(\mathbf{S}) \]

which is the sum of the diagonal elements of \(\mathbf{S}\). And tying back to the optimism story, if \(\hat{\mathbf{y}}\) arises from an additive-error model \(Y = f(X) + \epsilon\) with \(var(\epsilon) = \sigma^2_\epsilon\), then the degrees of freedom equal the total covariance between fits and observations, scaled by the noise:

\[ df(\hat{\mathbf{y}}) = \frac{\sum_{i=1}^ncov(\hat{y}_i, y_i)}{\sigma^2_\epsilon} \]


1.3.5.3 Cross-validation

The most direct way to estimate test error would be to set aside a chunk of data the model never sees. If your data set is large, you can randomly split it into a training set to fit the model and a validation set (also called a holdout or test set) to evaluate the error under your loss of choice. This validation-set approach is simple but has two real weaknesses:

  • It can be highly variable

  • With fewer observations to fit the model, the statistical models perform worse, and validation set error rate may overestimate the test error rate

The code below illustrates the approach on the Auto data, comparing a linear, quadratic, and cubic fit of mpg on horsepower. We fit on a random subset and report the MSE computed only on the held-out observations.

Show code
library(tidyverse)
library(ISLR2)
train <- sample(392, 120) # select random subset of 120 out of 392
Auto <- Auto %>% drop_na()
lm.fit <- lm(mpg ~ horsepower , data = Auto , subset = train)

attach(Auto)
mean((mpg - predict(lm.fit, Auto))[-train] ^ 2) # MSE in the validation set (not the training set)
#> [1] 24.53542

# Quadratic regression
lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
mean((mpg - predict(lm.fit2, Auto))[-train] ^ 2) # MSE
#> [1] 20.75886

# Cubic regression
lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
mean((mpg - predict(lm.fit3, Auto))[-train] ^ 2) # MSE
#> [1] 21.17195

The quadratic fit improves substantially on the linear one, while the cubic term buys little extra. The lesson is the same U-shape we saw earlier, now visible in code: there is a sweet spot in flexibility, and going past it does not help. The cross-validation methods that follow are designed to make this single split less wasteful and less variable.


1.3.5.3.1 Leave-One-Out CV (LOOCV)

The most extreme way to make the holdout small is to hold out exactly one observation at a time. Leave-one-out cross-validation (LOOCV) is the special case of K-fold CV with \(K = n\): fit the model on all but one point, predict that point, and average the resulting errors,

\[ CV_{(n)} = \frac{1}{n} \sum_{i=1}^n MSE_i \]

where \(MSE_i = (y_i - \hat{f}^{-i}(x_i))^2\) (an alternative loss function could also be used). LOOCV trades one weakness for another:

  • LOOCV has low bias as an estimate of the expected test error rate, but very high variance

  • Individual MSE’s in LOOCV are highly correlated because they pretty much have the same data set expect one observation.

  • LOOCV is expensive to implement

Intuition

Because every leave-one-out training set differs by just a single point, the \(n\) fitted models are almost identical, so their errors are highly correlated. Averaging highly correlated quantities does little to reduce variance, which is why LOOCV, despite its low bias, can be a noisy estimate of test error.

In R, fitting with glm() (rather than lm()) lets us pass the model to cv.glm() from the boot package, which computes the LOOCV error for us. Note that glm() without a link family gives exactly the same coefficients as lm().

Show code
Auto <- Auto %>% drop_na()
library(tidyverse)
glm.fit <- glm (mpg ~ horsepower , data = Auto) # allows to be passed to cv.glm()
coef(glm.fit)
#> (Intercept)  horsepower 
#>  39.9358610  -0.1578447

# exactly the same as glm without binomial link
lm.fit <- lm(mpg ~ horsepower , data = Auto)
coef(lm.fit)
#> (Intercept)  horsepower 
#>  39.9358610  -0.1578447

library(boot)
glm.fit <- glm (mpg ~ horsepower , data = Auto)
cv.err <- cv.glm (Auto , glm.fit)
cv.err$delta # cross-validation estimate for the test error
#> [1] 24.23151 24.23114
# the first delta is the standard CV estimate, the second delta is a bias-corrected version


# automate to different polynomial
cv.error <- rep (0, 10)
for (i in 1:10) {
    glm.fit <- glm(mpg ~ poly(horsepower , i), data = Auto)
    cv.error[i] <- cv.glm(Auto , glm.fit)$delta[1] # cross-validation estimate for the test error
}
cv.error # a drop from lienar to quadratic fit
#>  [1] 24.23151 19.24821 19.33498 19.42443 19.03321 18.97864 18.83305 18.96115
#>  [9] 19.06863 19.49093

The loop reports the LOOCV error for polynomials of degree 1 through 10. The big improvement comes in going from the linear to the quadratic fit; after that the error is essentially flat, confirming that a quadratic is the right level of flexibility here.


1.3.5.3.2 K-Fold Cross-Validation

The standard compromise between a single split and the full LOOCV is K-fold cross-validation. We randomly split the data into \(K\) equal-sized parts (folds). Each fold is held out in turn while the model is trained on the remaining \(K-1\) folds, and the loss is evaluated on the held-out fold.

Formally, for folds \(k = 1, \dots, K\), let \(\hat{f}^{-k}(x)\) denote the function fitted with the \(k\)-th fold removed, and compute the loss on the samples in that fold. With squared-error loss we get \(MSE_k\) for each fold, and the K-fold estimate is their average:

\[ CV_{(K)} = \frac{1}{K} \sum_{k=1}^K MSE_k \]

Tip

In practice \(K = 5\) or \(K = 10\) is the standard choice. These values give a good balance: each training set is large enough that the models are not badly degraded, yet the held-out folds are different enough that the errors are not as correlated as in LOOCV. K-fold with \(K=10\) is the default you should reach for unless you have a specific reason not to.

The code is identical to the LOOCV loop except for the K = 10 argument:

Show code
cv.error.10 <- rep(0, 10)
for (i in 1:10) {
    glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
    cv.error.10[i] <- cv.glm(Auto, glm.fit, K = 10)$delta[1] # K-fold cross validation
}
cv.error.10
#>  [1] 24.33123 19.24309 19.43197 19.68838 19.13228 18.93331 19.11709 19.02121
#>  [9] 19.12358 19.25690

The results tell the same story as LOOCV, a sharp drop from degree 1 to 2 and little change afterward, but were far cheaper to compute, which is the whole point of K-fold.


1.3.5.3.3 Generalized Cross-Validation

For linear fitting methods we can sometimes skip the refitting entirely. Recall the PRESS statistic, which says that the leave-one-out error can be obtained from a single fit by dividing each residual by \(1 - S_{ii}\):

\[ \frac{1}{n} \sum_{i=1}^n [y_i - \hat{f}^{-i}(x_i)]^2 = \frac{1}{n} \sum_{i=1}^n [\frac{y_i - \hat{f}(x_i)}{1- S_{ii}}]^2 \]

where \(S_{ii}\) is the \(i\)-th diagonal element of \(\mathbf{S}\) (the \(i\)-th hat-matrix diagonal in linear regression). So we only have to fit the model once.

Generalized cross-validation (GCV) is a closely related approximation for when the trace of \(\mathbf{S}\) is easier to compute than the individual diagonal entries \(S_{ii}\); it simply replaces each \(S_{ii}\) with the average \(trace(\mathbf{S})/n\):

\[ GCV(\hat{f})= \frac{1}{n} \sum_{i=1}^n [\frac{y_i - \hat{f}(x_i)}{1- trace(\mathbf{S})/n}]^2 \]


1.3.5.3.4 CV and Classification

Everything above used squared-error loss, but cross-validation works for classification too once we swap in an appropriate loss. The most common is 0-1 loss, which simply counts a misclassification as one error:

\[ L(G, \hat{G}(X)) = I(G \neq \hat{G}(X)) \]

where \(\hat{G}(X)\) is the class with the highest estimated conditional probability given the inputs.

A more refined alternative is the deviance, which uses the estimated probabilities directly rather than only the hard class assignment. If \(p_k(X) = P(G = k|X)\) and \(\hat{G}(X)\) is the class maximizing \(\hat{p}_k(X)\), then

\[ L(G, \hat{p}(X)) = -2 \sum_{k=1}^K I(G = k) \log \hat{p}_k (X) \equiv -2 \log \hat{p}_G (X) \]

which is \(-2\) times the log-likelihood (the deviance). It rewards a classifier for being confident and correct, and penalizes confident mistakes harshly.

Warning

Cross-validation must wrap the entire sequence of modeling steps. Any step that uses the outputs (variable selection, feature filtering by correlation with \(y\)) must happen inside the CV loop, after the fold is removed. Selecting features on the full data and then cross-validating only the final fit is a common and serious mistake that produces wildly optimistic error estimates. The one exception is a purely unsupervised preprocessing step on the inputs; since it never touches the outputs, it can be done once outside the loop.

1.3.5.4 Bootstrapping

Cross-validation estimates prediction error; the bootstrap is the natural tool when you instead want to quantify the uncertainty (the standard error or confidence interval) of an estimator. The idea is to use the data set as a stand-in for the population and resample from it. The procedure is:

  • To quantify the uncertainty associated with a given estimator

  • Get new data sets by sampling from the observed data set many times with replacement

  • Sample \(B\) different sets and calculate the quantity of interest about which you want to do inference \(\hat{\alpha}^{*1},\dots, \hat{\alpha}^{*B}\)

  • Then, you look at the distribution of samples and apply Monte Carlo procedures to do inference

  • For example, you want to get the standard error

\[ SE_B (\hat{\alpha}) = \sqrt{\frac{1}{B-1} \sum_{r=1}^B (\hat{\alpha}^{*r} - \frac{1}{B} \sum_{r'=1}^B \hat{\alpha}^{*r'})^2} \]

A few practical points are worth absorbing before the code. The bootstrap can in principle estimate model error, but it tends to be optimistic because the resampled training and test sets share many of the same observations; “leave-one-out” bootstrap variants help, though a training-set bias problem remains, and (Hastie, Friedman, and Tibshirani 2001) gives an estimator that reduces it further.

When to use this

Reach for cross-validation when you want to evaluate predictive performance (how well does the model predict new data?). Reach for the bootstrap when you want to evaluate the properties of an estimator (what is the standard error of this coefficient or statistic?). They answer different questions.

For a genuine big-data prediction problem you can combine the two approaches in a clean three-step recipe:

  • Use K-fold CV on the training data to choose the parameters that control flexibility of the model

  • Use the entire training data to fit the model selected by CV

  • Use a large hold-out sample to evaluate the test error from this fitted model (compare to the CV error).

The first example estimates the standard error of a portfolio statistic. We write a function returning the statistic of interest, confirm it works on a fixed index and on a random resample, then hand it to boot() to automate the resampling.

Show code
summary(Portfolio)
#>        X                  Y           
#>  Min.   :-2.43276   Min.   :-2.72528  
#>  1st Qu.:-0.88847   1st Qu.:-0.88572  
#>  Median :-0.26889   Median :-0.22871  
#>  Mean   :-0.07713   Mean   :-0.09694  
#>  3rd Qu.: 0.55809   3rd Qu.: 0.80671  
#>  Max.   : 2.46034   Max.   : 2.56599
alpha.fn <- function(data, index) {
    X <- data$X[index]
    Y <- data$Y[index]
    (var(Y) - cov(X, Y)) / (var(X) + var(Y) - 2 * cov(X, Y)) # statistic of interest
}

alpha.fn(Portfolio, 1:100)
#> [1] 0.5758321
alpha.fn(Portfolio, sample(100, 100, replace = T)) # randomly select 100 obs from 1:100 with replacement
#> [1] 0.6151444


# Another faster way
boot(Portfolio, alpha.fn, R = 100) # R = number of replicates (samples)
#> 
#> ORDINARY NONPARAMETRIC BOOTSTRAP
#> 
#> 
#> Call:
#> boot(data = Portfolio, statistic = alpha.fn, R = 100)
#> 
#> 
#> Bootstrap Statistics :
#>      original    bias    std. error
#> t1* 0.5758321 0.0133305   0.0935539

The second example bootstraps the coefficients of a linear model and compares the bootstrap standard errors to the ones the lm() summary reports.

Show code
boot.fn <- function(data,index){
    coef(lm(mpg ~ horsepower, data = data, subset = index))
}
boot.fn(Auto, 1:392)
#> (Intercept)  horsepower 
#>  39.9358610  -0.1578447
boot.fn(Auto, sample(392,392,replace =T))
#> (Intercept)  horsepower 
#>  39.4620939  -0.1522598

boot(Auto, boot.fn,1000)
#> 
#> ORDINARY NONPARAMETRIC BOOTSTRAP
#> 
#> 
#> Call:
#> boot(data = Auto, statistic = boot.fn, R = 1000)
#> 
#> 
#> Bootstrap Statistics :
#>       original        bias    std. error
#> t1* 39.9358610  0.0389358935  0.87374460
#> t2* -0.1578447 -0.0004985091  0.00753102

# there is a difference between the 2 SE estimates
summary(lm(mpg ~ horsepower, data = Auto))$coef
#>               Estimate  Std. Error   t value      Pr(>|t|)
#> (Intercept) 39.9358610 0.717498656  55.65984 1.220362e-187
#> horsepower  -0.1578447 0.006445501 -24.48914  7.031989e-81




# Quadratic example
# boot.fn <- function(data,index){
#     coef(
#         lm(mpg ~ horsepower + I(horsepower^2),
#            data = data, subset = index)
#     )
# }
# boot(Auto, boot.fn, 1000)
# summary(
#     lm(mpg ~ horsepower + I(horsepower^2), data = Auto)$coef
# )

The bootstrap and the lm() standard errors disagree, and the disagreement is informative rather than a bug. There are two reasons:

  1. LM estimates \(\sigma^2\) suing the RSS, which relies heavily on the linear model being correct. In other words, for example, if we have non-linearity in our data, the residuals (\(e\)) will be inflated, and leads to \(\hat{\sigma}^2\) being inflated as well.
  2. LM assumes \(x_i\) fixed and variability comes from \(var(\epsilon_i)\)

In short, the lm() formula trusts the model’s assumptions, while the bootstrap does not; when the assumptions are even slightly off, the bootstrap standard errors are often the more honest ones.


1.4 Improving Prediction Accuracy and Interpretability

The rest of this chapter addresses a problem that arises constantly in modern data: too many predictors. Concretely, suppose you face one or more of the following:

  1. large number of predictors
  2. highly dependent predictors
  3. low interpretability

Ordinary least squares handles none of these gracefully. There are three broad families of remedies, and the rest of the chapter takes them one at a time:

  1. Subset Selection

  2. Shrinkage Methods (also known as “regularization”): force (shrink) p predictors’ parameters towards zero to reduce variance. It can also be used for variable selection

  3. Dimension Reduction: Project the p predictors in an M-dimensional subpsace where \(M <p\) (i.e., compute M different linear combination or projection of the p predictors and use them as new predictors). (e.g., principal component analysis)

Key idea

All three families fight the same enemy, variance from having too many parameters, but by different means. Subset selection drops predictors entirely; shrinkage keeps them all but pulls their coefficients toward zero; dimension reduction replaces the predictors with a few informative combinations. Each accepts a little bias to win a larger reduction in variance.

1.4.1 Subset Selection

The most direct approach is to discard the predictors you do not need. This is also known as Variable Selection (follow the hyperlink for more detail).

1.4.1.1 Best Subset Selection

The brute-force version, best subset selection, considers every possible combination of predictors and keeps the best. Concretely, fit all \(p\) one-predictor models, then all \(p(p-1)/2\) two-predictor models, and so on, for \(2^p\) models in total, then pick the winner using one of the standard criteria:

  • Fit all \(p\) one by one, then all \(p(p-1)/2\) two predicotrs models, etc. to have \(2^p\) possible models. Then you find the best one using

    • \(C_p\)

    • AIC

    • BIC

    • Adjusted \(R^2\)

    • cross-validation error measure

A practical speedup is the leaps-and-bounds algorithm of Furnival and Wilson (1974), which is feasible up to about 40 predictors. The usual workflow is to run best-subset selection first and then confirm the choice with cross-validation.

The code below loads the Hitters data, drops the rows with a missing salary, and runs regsubsets() from the leaps package to perform best-subset selection.

Show code
library(tidyverse)
library(ISLR2)
head(Hitters) # check dataset
#>                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
#> -Andy Allanson      293   66     1   30  29    14     1    293    66      1
#> -Alan Ashby         315   81     7   24  38    39    14   3449   835     69
#> -Alvin Davis        479  130    18   66  72    76     3   1624   457     63
#> -Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
#> -Andres Galarraga   321   87    10   39  42    30     2    396   101     12
#> -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
#>                   CRuns CRBI CWalks League Division PutOuts Assists Errors
#> -Andy Allanson       30   29     14      A        E     446      33     20
#> -Alan Ashby         321  414    375      N        W     632      43     10
#> -Alvin Davis        224  266    263      A        W     880      82     14
#> -Andre Dawson       828  838    354      N        E     200      11      3
#> -Andres Galarraga    48   46     33      N        E     805      40      4
#> -Alfredo Griffin    501  336    194      A        W     282     421     25
#>                   Salary NewLeague
#> -Andy Allanson        NA         A
#> -Alan Ashby        475.0         N
#> -Alvin Davis       480.0         A
#> -Andre Dawson      500.0         N
#> -Andres Galarraga   91.5         N
#> -Alfredo Griffin   750.0         A

# check missing values
sum(is.na(Hitters$Salary)) # 59 missing data poitns
#> [1] 59

# eliminate missing values
Hitters <- Hitters %>%
    drop_na(Salary)

# Best subset seleciton (by RSS)
library(leaps)
regfit.full <- regsubsets(Salary ~ ., Hitters)
# summary(regfit.full)


1.4.1.2 Stepwise Selection

Because best subset selection examines \(2^p\) models, it becomes infeasible for large \(p\). Stepwise selection is the greedy alternative: it adds or removes one predictor at a time, examining far fewer models. There are three common variants, each with its own constraints:

  • With very large numbers of p, stepwise selection can be less computational intensive

  • Forward stepwise selection

    • can be used when \(p >n\), but only models up to order \(n-1\)
  • Backward stepwise selection

    • can’t be applied when \(p>n\)
  • Hybrid stepwise

When you use cross-validation to pick the model size, do not simply grab the size with the lowest CV error, because that estimate is itself noisy. Instead apply the one-standard-error rule:

  • Evaluates cross-validation criteria (MSE) for each model size

  • Select the smallest model where the estimated test error is within one SE of the lowest point on the curve (i.e.more parsimonious model)

Tip

The one-standard-error rule deliberately prefers a simpler model over a marginally-better-scoring complex one. Since the CV curve is estimated with noise, models within one standard error of the minimum are statistically indistinguishable, so you may as well take the simplest among them.

1.4.2 Shrinkage Methods

Rather than dropping predictors outright, shrinkage methods keep all of them but constrain the least-squares solution so that the coefficients are pulled (regularized) toward zero. This introduces a little bias but reduces variance, often a very good trade in the high-\(p\) setting.

1.4.2.1 Ridge regression

Ridge regression is the first shrinkage method and has a useful side benefit: it can handle multicollinearity. Mechanically, it adds a constant to the diagonal of the \(\mathbf{X}^T \mathbf{X}\) matrix, which produces well-behaved, low-variance estimates even when \(\mathbf{X}^T \mathbf{X}\) is nearly singular (the situation that makes ordinary OLS blow up).

To see it, recall that ordinary least squares minimizes the residual sum of squares,

\[ RSS = \sum_{i=1}^n(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij})^2 \]

Ridge regression adds a penalty on the size of the coefficients, estimating the \(\hat{\beta}^R\) that minimize

\[ \sum_{i=1}^n (y_i- \beta - \sum_{j=1}^p \beta_j x_{ij})^2 + \lambda \sum_{j=1}^p \beta_j^2 \]

where \(\lambda\) is a tuning parameter. The added term \(\lambda \sum_{j=1}^p \beta_j^2\) is an \(l_2\)-norm shrinkage penalty that is small only when the coefficients are near zero, so minimizing the whole objective forces \(\hat{\beta}^R\) to be closer to zero than the OLS estimates. The tuning parameter controls how hard we push:

  • When \(\lambda = 0\), we get the LS estimates

  • When \(\lambda \to \infty\) , the coefficients go to 0.

Because the penalty deliberately excludes the intercept (we do not want to shrink the overall level of \(Y\)), we first center all covariates by removing each column’s mean, which makes \(\hat{\beta}_0= \bar{y}\). The ridge solution then has a closed form much like OLS but with \(\lambda \mathbf{I}\) added to the cross-product matrix:

\[ \hat{\beta}^R = (\mathbf{X}^T \mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^T \mathbf{y} \]

The tuning parameter \(\lambda\) is chosen by cross-validation. Several properties are worth keeping in mind:

Warning

The ridge estimator is not scale-invariant: rescale a predictor and its coefficient changes substantially. You must therefore standardize (center and divide each column by its standard deviation) before fitting, or the penalty will fall unequally on predictors merely because of their units.

  • \(\lambda\) allows for a continuous range of bias-variance trade-offs.

    • As \(\lambda\) increases, the variance decreases but the bias increases. Hence, it works well when LS estimates have high variance.
  • Possible even when \(p>n\)

There are two illuminating ways to view ridge regression. From a Bayesian standpoint, if \(y_i \sim N(\beta_0 + x^T_i\beta, \sigma^2)\) is the likelihood and the coefficients have independent priors \(\beta_j \sim N(0, \tau^2)\) with known \(\tau^2, \sigma^2\), then \(\lambda = \sigma^2 / \tau^2\) (the ratio of noise variance to prior variance) and the ridge estimate is the posterior mode, which here equals the posterior mean since the posterior is Gaussian. From the geometry of principal components, ridge shrinks the low-variance principal-component directions the most, a connection made fully explicit in the Principal Component Regression section.


1.4.2.2 Lasso regression

Ridge has one frustrating limitation: short of \(\lambda = \infty\) it never sets a coefficient exactly to zero, so it cannot produce a sparse, easily-read model. To fix this, Tibshirani (1996) proposed the lasso, which minimizes

\[ \sum_{i=1}^n (y_i - \beta_0- \sum_{j=1}^p \beta_j x_{ij})^2 + \lambda\sum_{j=1}^p|\beta_j| \]

The only change from ridge is the penalty term:

  • Lasso considers \(l_1\) norm penalty (given by \(||\beta||_1 = \sum_j |\beta_j|\))

  • Ridge considers \(l_2\) norm penalty (given by \(||\beta||_2 = \sqrt{\sum_j \beta_j^2}\)) similar to Euclidean distance

That seemingly small change has a large consequence: once \(\lambda\) is large enough, the lasso sets some coefficients exactly to zero, dropping those predictors from the model and performing variable selection automatically. There is no closed form for \(\hat{\beta}^L\), and \(\lambda\) is again chosen by cross-validation.

Intuition

Why does the \(l_1\) penalty zero out coefficients while the \(l_2\) penalty does not? The \(l_1\) constraint region is a diamond with sharp corners that sit on the axes, and the solution tends to land on a corner, where some coordinates are exactly zero. The \(l_2\) region is a smooth ball with no corners, so it shrinks coefficients toward zero but rarely lands exactly on it.

A few comparisons help decide between the two:

  • Lasso outperforms ridge when there are redundant predictors, but ridge outperforms lasso when all included predictors are needed

  • Lasso shrinks each parameter by approximately the same amount, while ridge shrinks proportionally

  • Lasso equivalent Bayesian interpretation is that: if the prior distribution on the parameters \(\beta_j\) are independent double exponential (Laplace) distribution with mean 0 and scale parameter \(\lambda\), the posterior mode for \(\beta\) is the lasso solution (but no the posterior mean); the Laplace prior puts more “mass” at 0, giving more a priori belief that the coefficients are 0.

1.4.2.3 Comparison

Best subset selection, lasso, and ridge can all be written as the same minimization of RSS subject to a constraint on the coefficients, differing only in the shape of the constraint:

\[ \displaystyle{\min_{\beta}} \sum_{i=1}^n (y_i - \beta_0- \sum_{j=1}^p \beta_j x_{ij})^2 \text{ subject to } \begin{cases} \sum_{j=1}^p |\beta_j| \le s \text{ (lasso)} \\ \sum_{j=1}^p \beta_j^2 \le s \text{ (ridge)} \\ \sum_{j=1}^p I(\beta_j \neq 0) \le s \text{ (best subset)} \end{cases} \]

All three seek to minimize the RSS subject to a budget on the parameters. The best-subset constraint caps the number of nonzero coefficients at \(s\), which is exactly what makes it combinatorially hard and infeasible for large \(p\). Lasso and ridge replace that discrete cap with a continuous one, which is why they remain tractable.

1.4.2.4 Cross-validation Fitting

Whichever penalized method you use, the tuning parameter is chosen the same way. The recipe is:

  1. Choose a “grid” of \(\lambda\) values
  2. Compute the cross-validation error for each vlaue of \(\lambda\)
  3. Select the \(\lambda\) for which the cross-validation error is smallest (one-standard error rule)
  4. Refit the model using all available observations for the selected tuning parameter.


1.4.2.5 Generalization

Ridge and lasso are two points on a continuum. The general penalized objective is

\[ \sum_{i=1}^n (y_i - \beta_0- \sum_{j=1}^p \beta_j x_{ij})^2 + \lambda \sum_{j=1}^p |\beta_j|^q \]

for \(q \ge 0\), where \(q = 0, 1, 2\) correspond to subset selection, lasso, and ridge respectively. In principle we could estimate \(q\) from the data, but it is generally not worth the effort. A more useful middle ground is the elastic net, which combines the ridge and lasso penalties to get both grouped shrinkage and sparsity.


There is also a fully Bayesian route to variable selection, building on the observation that the penalty term is the prior on \(\beta\) (a normal prior gives ridge, a Laplace prior gives lasso). Treating model selection itself probabilistically leads to stochastic search variable selection (SSVS) George and McCulloch (1993).

SSVS places a hierarchical mixture-of-normals prior on each coefficient:

\[ \beta_j |\gamma_j \sim \gamma_j N(0, c_j \tau^2_j) + (1- \gamma_j)N(0, \tau^2_j) \]

\[ \gamma_j \sim^{idd}Bern(\pi_j) \]

where \(\gamma_j=1\) means the \(j\)-th variable is in the model, with probability \(\pi_j\). We want \(\tau_j\) small so that when \(\gamma_j = 0\) the coefficient \(\beta_j\) is pulled close to zero, and we want \(c_j\) large so that when \(\gamma_j = 1\) the coefficient is free to be substantial (analogous to ridge). A close relative, the spike and slab prior, replaces the second mixture component with a delta spike exactly at zero (analogous to lasso).

1.4.3 Dimension Reduction

The third strategy neither drops predictors nor shrinks their coefficients in place; instead it replaces the \(p\) original predictors with a smaller set of \(M < p\) combinations of them. Let \(Z_1, \dots, Z_M\) be \(M\) linear combinations of the original predictors:

\[ Z_m = \sum_{j=1}^p \phi_{jm} X_j, m = 1,\dots, M \]

for constants \(\phi_{1m}, \dots, \phi_{pm}\). We then run an ordinary least-squares regression of \(Y\) on these new combinations rather than on the originals:

\[ y_i = \theta_0 + \sum_{m=1}^M \theta_m z_{im} + \epsilon_i, i = 1, \dots,n \]

The whole game is choosing the weights \(\phi_{jm}\) well, so that this smaller regression does better than OLS on the raw inputs. A short calculation shows exactly what kind of model this is:

\[ \begin{aligned} \sum_{m=1}^M \theta_m z_{im} &= \sum_{m=1}^M \theta_m \sum_{j=1}^p \phi_{jm} x_{ij} \\ &= \sum_{j=1}^p \sum_{m=1}^M \theta_m \phi_{jm} x_{ij} \\ &= \sum_{j=1}^p \beta_j x_{ij} \\ \end{aligned} \]

so that the implied original-scale coefficients are

\[ \beta_j = \sum_{m=1}^M \theta_m \phi_{jm} \]

Key idea

Dimension-reduction regression is just OLS with the coefficients \(\beta_j\) constrained to lie in a lower-dimensional space. When \(M = p\) and the \(Z_m\) are linearly independent, the constraint disappears and you recover plain OLS (with the bonus that the new predictors are orthogonal). The benefit appears when \(M \ll p\): the constraint biases the fit slightly but cuts its variance sharply.

As with the Shrinkage Methods, this constrained solution is biased but typically has much lower variance in the fitted coefficients, especially when \(M \ll p\).

1.4.3.1 Principal Component Regression

The most common way to choose the combinations ignores \(Y\) entirely and simply finds the directions of greatest variation in the inputs. The procedure is:

  1. Find the linear combination of the inputs that accounted for the largest fraction of variance of the inputs
  2. Find another linear combination of the inputs that accounted for the next highest amount of variation (subject to being orthogonal to the first)
  3. Continue the process

These directions are computed efficiently by the singular value decomposition (SVD) of the centered data matrix. Let \(\mathbf{X}\) (\(n \times p\)) be the centered data matrix; then

\[ \mathbf{X} = \mathbf{UDV}^T \]

where \(\mathbf{U}\) and \(\mathbf{V}\) are \(n \times p\) and \(p \times p\) orthogonal matrices, and \(\mathbf{D}\) (the singular values) is a \(p \times p\) diagonal matrix with ordered diagonal entries \(d_1 \ge d_2 \ge \dots \ge d_p \ge 0\). It follows that

\[ \mathbf{X}^T \mathbf{X} = \mathbf{VD}^2 \mathbf{V}^T \]

which is the eigen-decomposition of \(\mathbf{X}^T \mathbf{X}\), and (up to the factor \(n\)) of the sample covariance matrix \(\mathbf{S} = \frac{\mathbf{X}^T \mathbf{X}}{n}\). The columns of \(\mathbf{V}\) are the eigenvectors and serve as the principal component weights, so the new variables are

\[ \mathbf{z}_1 = \mathbf{Xv}_1 , \dots, \mathbf{z}_M = \mathbf{Xv}_M \]

where \(\mathbf{v}_m\) is the \(m\)-th column of \(\mathbf{V}\). These \(\mathbf{z}\)’s are the principal components, and the elements of \(\mathbf{v}_m\) are exactly the \(\phi\)’s from the dimension-reduction setup above. The variance captured by each component is governed by its singular value,

\[ var(\mathbf{z}_m) = \frac{d^2_m}{n} \]

so large singular values mark directions of large variance and small ones mark directions of little variance. Principal component regression (PCR) then constructs the first \(M\) components \(Z_1, \dots, Z_M\) and uses them as the predictors in a least-squares regression.

Warning

PCR rests on the assumption that the directions in which \(X_1, \dots, X_p\) vary the most are also the directions most associated with \(Y\). This is usually, but not always, true: the inputs are chosen entirely without looking at \(Y\), so PCR can in principle discard a low-variance direction that happens to be the one that predicts \(Y\). And capturing variance never establishes causation.

A few further notes situate PCR among its neighbors:

  • While capturing most of the variance from inputs, since \(M <<p\) we don’t run into the problem of overfitting

  • Another benefit is that the new predictors are orthogonal (\(\mathbf{z}^T_j \mathbf{z}_k =0, j \neq k\)), which in the LS estimation we typically run into the problem of collinearity

  • Even though PCR uses a low-dimensional set of predictors, no model selection was performed because we only translate all \(p\) variables into a linear combination that makes the new PC inputs. Hence, it’s closely related to Ridge regression than Lasso regression or Best Subset Selection. Seeing Singular Value Decomposition

    • \(\mathbf{X}\hat{\beta}^R = \sum_{j=1}^p \mathbf{u}_j \frac{d_j^2}{d_j^2+ \lambda}\mathbf{u}_j^T \mathbf{y}\)

    • where \(\mathbf{u}_j\) corresponds to the j-th column of \(\mathbf{U}\) from the SVD

    • Since LS estimate gives \(\mathbf{X}\hat{\beta} = \mathbf{UU}^T \mathbf{y}\), ridge regression goes to the LS solution when \(\lambda = 0\)

    • The singular vectors that have smaller variation (i.e., smaller \(d_j^2\)) are shrunk more than those with larger variation

Tip

Standardize the predictors before running the SVD. If you do not, predictors measured on large numerical scales will dominate the variance and the principal components purely because of their units, regardless of their actual importance for \(Y\).

1.4.3.2 Partial Least Squares

PCR’s blind spot is that it chooses directions without ever consulting \(Y\). Partial least squares (PLS) fixes that by finding directions that are high-variance and correlated with the response. Like PCR it finds linear-combination directions in \(\mathbf{X}\) that maximize variance and are orthogonal, but unlike PCR (which is unsupervised, and so may choose directions that predict \(Y\) poorly), PLS is supervised: it selects directions to account for the covariation with \(Y\).

The directions \(Z_m = \sum_{j=1}^p\phi_{jm}X_j\) are built up one at a time:

  1. Standardize the \(p\) predictors

  2. For \(Z_1\), set each \(\phi_{j1}\) = the coefficient from the simple linear regression of Y on \(X_j\) (this is proportional to the correlation between Y and \(X_j\))

  3. Calculate \(Z_2\), by calculating the residuals of a regression of each variable (separately) on \(Z_1\) (call these \(e_j\)), which corresponds to the info remaining that has not been accounted for by \(Z_1\)

  4. Set \(\phi_{j2}\) = the coefficient from the simple linear regression of \(Y\) on each \(e_j\) to get \(Z_2\)

  5. Repeat \(M\) to get \(Z_1, \dots, Z_2\)

The difference between the two methods is sharpest in the optimization problem each one solves, from Hastie, Friedman, and Tibshirani (2001) (chapter 3.5). The \(m\)-th PCR direction \(\mathbf{v}_m\) is the \(\alpha\) that solves

\[ \max_\alpha[var(\mathbf{X\alpha})] \]

subject to \(||\alpha||=1, \alpha^T \mathbf{Sv}_l = 0, l = 1,\dots, m-1\) (the orthogonality condition ensures each \(\mathbf{z}_m = \mathbf{Xv}_m\) is uncorrelated with the previous \(\mathbf{z}_l\)), where \(\mathbf{S}\) is the sample covariance matrix of \(\mathbf{X}\). The \(m\)-th PLS direction \(\hat{\phi}_m\) instead maximizes a product of correlation with \(Y\) and input variance:

\[ \max_\alpha [corr^2(\mathbf{y}, \mathbf{X \alpha}) var(\mathbf{X \alpha})] \]

subject to \(||\alpha|| =1, \alpha^T \mathbf{S}\hat{\phi}_l =0,l = 1, \dots, m-1\). That extra \(corr^2(\mathbf{y}, \mathbf{X\alpha})\) factor is exactly the “supervision” that PCR lacks.

Note

In practice the variance term tends to dominate the correlation term, so PLS often behaves very similarly to ridge regression and PCR. The supervision can reduce bias, but it adds variance, and that extra variance can erode the advantage PLS is supposed to have over plain PCR. Do not expect dramatic gains.

A few closing points on PLS:

  • Both X and Y should be standardized before performing PLS

  • The choice of \(M\) is made based on Cross-validation

  • PLS is closely related to canonical correlation analysis (CCA) in multivariate statistic, which seeks to find a reduced set of linear combinations of inputs and linear combination of outputs that maximize the correlation (e.lg., finding the coefficient \(\alpha, \beta\) s.t

\[ \max_{\alpha, \beta} [corr(\mathbf{Y \beta}, \mathbf{X \alpha})] \]

where a set of \(M\) with orthogonality constraints

We close the chapter by stepping back to the high-dimensional setting that motivated all three families of methods. When \(p\) approaches or exceeds \(n\), models overfit easily: they are so flexible that they can look far better on the training data than they really are. The remedy is to use less flexible, regularized or constrained models, namely Best Subset Selection, Ridge regression, Lasso regression, Principal Component Regression, and Partial Least Squares. But even these come with caveats you must keep in mind:

  • Large \(p\) always has collinearity; Best Subset Selection or Lasso regression can’t guarantee the chosen model is a scientific one

  • p-values are not considered under high-dimensional regressions

  • Tuning parameter selection is utmost important for good predictive performance

  • The test error tends to increase as the dimensionality of the problem increases (Curse of dimensionality)

Key idea

The unifying message of this chapter is that flexibility is double-edged. The whole craft of statistical learning is choosing, and honestly validating, just enough flexibility to capture the signal without chasing the noise. Every method that follows in this book is one more tool for striking that balance.


  1. The two communities, computer science and statistics, grew up using different vocabulary for the same ideas. “Features” and “inputs” mean the same thing, as do “weights” and “coefficients,” and “supervised learning” and “regression/classification.” We will move freely between both vocabularies.↩︎