Advanced Data Analysis

Nguyen, Mike

81 Predictor Reduction

Modern datasets often arrive with far more predictors than we actually need. Some columns carry almost no information about the outcome, some duplicate what other columns already tell us, and some add noise that makes a model harder to fit and harder to trust. Predictor reduction is the practice of trimming this set down to a smaller, more useful collection of variables before (or while) we build a model.

It helps to keep the payoff in mind. A smaller predictor set usually trains faster, is cheaper to collect and maintain in production, and is far easier to explain to a colleague or a regulator. It can also improve accuracy, because removing irrelevant predictors reduces the chance that the model latches onto patterns that happen to appear in the training sample but do not hold in general.

Key idea

Predictor reduction is not just housekeeping. Fewer, better predictors can mean a model that is faster, cheaper, more interpretable, and sometimes more accurate.

In this chapter you will learn the two broad families of strategies for choosing which predictors to keep, how they differ in cost and in what they optimize, and the one statistical trap that lurks behind both of them. This sits alongside the other dimension-handling ideas in the book: where dimension reduction methods (Chapter 27) build new combined features, predictor reduction keeps the original predictors and simply decides which ones to drop.

Note

Predictor reduction (also called feature selection) keeps a subset of your original variables. This is different from dimension reduction methods (Chapter 27) such as principal components, which replace the originals with new constructed combinations. Selection preserves interpretability; reduction trades it for compactness.

81.1 Two ways to reduce predictors

There are two main families of approaches, and they differ in one essential way: whether the predictive model itself is part of the selection loop. The first family puts the model inside the loop, the second keeps it outside.

81.1.1 Wrapper methods

Wrapper methods evaluate many candidate models, adding and removing predictors in search of the combination that maximizes model performance (Chapter 82). The name comes from the fact that the selection procedure “wraps around” an actual model: you fit a model, score it, change the predictor set, fit again, and repeat.

Intuition

A wrapper method treats your predictors as dials. It turns dials on and off, measures how well the resulting model does, and keeps searching for the setting that scores best. The model’s own performance is the compass.

The simplest wrappers are the classic stepwise search procedures: forward selection, which starts with no predictors and adds the most helpful one at each step; backward selection, which starts with all predictors and removes the least helpful one at each step; and stepwise selection, which combines the two so that a variable added earlier can later be dropped if it stops earning its place.

We can generalize this idea with more powerful search algorithms when the space of possible predictor subsets is too large to explore greedily. Techniques such as simulated annealing, genetic algorithms, and particle swarm optimization explore that space more cleverly and can escape the local traps that trip up simple stepwise search.¹ The trade-off is cost: they tend to find better subsets but require many more model fits to do so.

Warning

Because a wrapper refits a model for every candidate subset it considers, it can be very expensive. With a slow model or a large number of predictors, even a basic stepwise search can run for a long time, and a genetic algorithm far longer.

81.1.2 Filter methods

Filter methods take the opposite stance. They judge the relevance of each predictor outside of the predictive model, using a quick statistical screen, and then build a model using only the predictors that pass some threshold. The model never participates in the selection itself.

A common version screens one variable at a time, scoring each predictor by how strongly it relates to the outcome on its own, for example through a correlation, a univariate $R^2$, or a simple test statistic. Predictors that clear the chosen cutoff are kept; the rest are filtered out before any modeling begins.

Intuition

A filter is like a bouncer at the door who checks each predictor individually before the modeling party starts. It is fast because it never has to fit the full model to make a decision.

This speed is the main attraction, and filters are usually much cheaper than wrappers. But the savings come with two caveats worth stating plainly. First, the screening criterion is not directly tied to how well the final model performs, so a predictor that looks weak on its own might have been valuable in combination with others, and vice versa. Second, the choice of criterion and cutoff is somewhat subjective, and different reasonable choices can lead to different kept sets.

When to use this

Reach for a filter when you have a very large number of predictors and need a fast first pass, or when fitting the model is too expensive to put inside a search loop. Reach for a wrapper when accuracy matters more than compute and you want the selection to be guided by the model you actually intend to use.

81.2 The trap behind both approaches: selection bias

Whichever family you choose, the same danger applies, and it is worth ending on because it is so easy to fall into. If you use the same data both to select predictors and to estimate how well the resulting model performs, your performance estimate will be optimistically biased. The selection step has already peeked at the outcome, so it can pick predictors that look good on this particular sample by luck, and any later evaluation on that same sample inherits the flattery.

Warning

Predictor selection is part of model fitting. It must happen inside your resampling loop (cross-validation or a held-out validation set), not before it. Selecting features on the full dataset and only then cross-validating the model gives results that can look excellent and fail badly on new data.

The fix is to treat selection as just another modeling decision that the data should not see in advance: perform the screening or the search separately within each training fold, and reserve untouched data to judge the final result.²

Tip

A simple mental check: ask whether any step of your pipeline that looked at the outcome was repeated inside cross-validation. If predictor selection was done once, up front, on all the data, your reported accuracy is probably too good.

81.2.1 The trap, demonstrated

This trap is easy to dismiss as a technicality until you watch it manufacture accuracy from nothing. We build a dataset with no signal whatsoever — 100 samples, 2000 standard-normal predictors, and class labels assigned by a coin flip — so the only honest accuracy is 50%.

Show code

library(MASS)
set.seed(1)
n <- 100; p <- 2000
X <- matrix(rnorm(n * p), n, p)
y <- factor(sample(c("A", "B"), n, replace = TRUE))   # labels independent of X

Now compare two cross-validation protocols. Both keep only the ten predictors most correlated with the label; they differ only in whether that selection happens before the folds are cut or inside each fold:

Show code

pick <- function(rows, m = 10) order(abs(cor(X[rows, ], as.numeric(y[rows]))), decreasing = TRUE)[1:m]
cv_acc <- function(select_inside) {
  folds  <- sample(rep(1:5, length.out = n))
  global <- pick(1:n)                                  # selection on ALL data (peeks at test labels)
  mean(sapply(1:5, function(k) {
    tr <- which(folds != k); ts <- which(folds == k)
    f  <- if (select_inside) pick(tr) else global      # honest: re-select using only training rows
    fit <- lda(X[tr, f], y[tr])
    mean(predict(fit, X[ts, f, drop = FALSE])$class == y[ts])
  }))
}
set.seed(1)
c(select_outside_CV = mean(replicate(20, cv_acc(FALSE))),
  select_inside_CV  = mean(replicate(20, cv_acc(TRUE)))) |> round(3)
#> select_outside_CV  select_inside_CV 
#>             0.846             0.514

Selecting outside the loop reports ~85% accuracy on data that contains no signal at all; selecting inside the loop reports the truthful ~50%. The inflation is entirely an artifact of letting the held-out labels influence which predictors were chosen — with 2000 candidates, ten of them will correlate with the labels by chance, and once chosen they keep “working” on every fold. The fix is mechanical and non-negotiable: every outcome-dependent step, selection included, must live inside the resampling loop, which in practice means wrapping it in a pipeline (for example recipes + workflows in tidymodels, or caret’s rfe with selection nested in its folds) rather than running it once by hand.

To summarize, wrapper methods search predictor subsets using model performance as the guide and tend to be more powerful but more expensive, while filter methods screen predictors with a cheaper external criterion that is faster but less directly aligned with model quality. Both can sharpen a model, and both will mislead you about that model’s real performance unless selection is folded honestly into your validation procedure.

These are general-purpose optimization heuristics borrowed from operations research and evolutionary computation. The shared idea is to treat “which predictors are in the model” as a configuration to optimize, and to search it without checking every possibility.↩︎
This is sometimes called the “selection bias” or “feature-selection leakage” problem. A classic illustration: screen thousands of pure-noise predictors for the ones most correlated with a random outcome, and the surviving handful will look predictive on the training data even though nothing real is there.↩︎

# Predictor Reduction {#sec-predictor-reduction} ```{r} #| include: false source("_common.R") ``` Modern datasets often arrive with far more predictors than we actually need. Some columns carry almost no information about the outcome, some duplicate what other columns already tell us, and some add noise that makes a model harder to fit and harder to trust. Predictor reduction is the practice of trimming this set down to a smaller, more useful collection of variables before (or while) we build a model. It helps to keep the payoff in mind. A smaller predictor set usually trains faster, is cheaper to collect and maintain in production, and is far easier to explain to a colleague or a regulator. It can also improve accuracy, because removing irrelevant predictors reduces the chance that the model latches onto patterns that happen to appear in the training sample but do not hold in general. ::: {.callout-important title="Key idea"} Predictor reduction is not just housekeeping. Fewer, better predictors can mean a model that is faster, cheaper, more interpretable, and sometimes more accurate. ::: In this chapter you will learn the two broad families of strategies for choosing which predictors to keep, how they differ in cost and in what they optimize, and the one statistical trap that lurks behind both of them. This sits alongside the other dimension-handling ideas in the book: where dimension reduction methods (@sec-dimension-reduction) build new combined features, predictor reduction keeps the original predictors and simply decides which ones to drop. ::: {.callout-note} Predictor reduction (also called feature selection) keeps a subset of your *original* variables. This is different from dimension reduction methods (@sec-dimension-reduction) such as principal components, which replace the originals with new constructed combinations. Selection preserves interpretability; reduction trades it for compactness. ::: ## Two ways to reduce predictors There are two main families of approaches, and they differ in one essential way: whether the predictive model itself is part of the selection loop. The first family puts the model inside the loop, the second keeps it outside. ### Wrapper methods Wrapper methods evaluate many candidate models, adding and removing predictors in search of the combination that maximizes model performance (@sec-model-performance). The name comes from the fact that the selection procedure "wraps around" an actual model: you fit a model, score it, change the predictor set, fit again, and repeat. ::: {.callout-tip title="Intuition"} A wrapper method treats your predictors as dials. It turns dials on and off, measures how well the resulting model does, and keeps searching for the setting that scores best. The model's own performance is the compass. ::: The simplest wrappers are the classic stepwise search procedures: forward selection, which starts with no predictors and adds the most helpful one at each step; backward selection, which starts with all predictors and removes the least helpful one at each step; and stepwise selection, which combines the two so that a variable added earlier can later be dropped if it stops earning its place. We can generalize this idea with more powerful search algorithms when the space of possible predictor subsets is too large to explore greedily. Techniques such as simulated annealing, genetic algorithms, and particle swarm optimization explore that space more cleverly and can escape the local traps that trip up simple stepwise search.^[These are general-purpose optimization heuristics borrowed from operations research and evolutionary computation. The shared idea is to treat "which predictors are in the model" as a configuration to optimize, and to search it without checking every possibility.] The trade-off is cost: they tend to find better subsets but require many more model fits to do so. ::: {.callout-warning} Because a wrapper refits a model for every candidate subset it considers, it can be very expensive. With a slow model or a large number of predictors, even a basic stepwise search can run for a long time, and a genetic algorithm far longer. ::: ### Filter methods Filter methods take the opposite stance. They judge the relevance of each predictor *outside* of the predictive model, using a quick statistical screen, and then build a model using only the predictors that pass some threshold. The model never participates in the selection itself. A common version screens one variable at a time, scoring each predictor by how strongly it relates to the outcome on its own, for example through a correlation, a univariate $R^2$, or a simple test statistic. Predictors that clear the chosen cutoff are kept; the rest are filtered out before any modeling begins. ::: {.callout-tip title="Intuition"} A filter is like a bouncer at the door who checks each predictor individually before the modeling party starts. It is fast because it never has to fit the full model to make a decision. ::: This speed is the main attraction, and filters are usually much cheaper than wrappers. But the savings come with two caveats worth stating plainly. First, the screening criterion is not directly tied to how well the final model performs, so a predictor that looks weak on its own might have been valuable in combination with others, and vice versa. Second, the choice of criterion and cutoff is somewhat subjective, and different reasonable choices can lead to different kept sets. ::: {.callout-tip title="When to use this"} Reach for a filter when you have a very large number of predictors and need a fast first pass, or when fitting the model is too expensive to put inside a search loop. Reach for a wrapper when accuracy matters more than compute and you want the selection to be guided by the model you actually intend to use. ::: ## The trap behind both approaches: selection bias Whichever family you choose, the same danger applies, and it is worth ending on because it is so easy to fall into. If you use the *same* data both to select predictors and to estimate how well the resulting model performs, your performance estimate will be optimistically biased. The selection step has already peeked at the outcome, so it can pick predictors that look good on this particular sample by luck, and any later evaluation on that same sample inherits the flattery. ::: {.callout-warning} Predictor selection is part of model fitting. It must happen *inside* your resampling loop (cross-validation or a held-out validation set), not before it. Selecting features on the full dataset and only then cross-validating the model gives results that can look excellent and fail badly on new data. ::: The fix is to treat selection as just another modeling decision that the data should not see in advance: perform the screening or the search separately within each training fold, and reserve untouched data to judge the final result.^[This is sometimes called the "selection bias" or "feature-selection leakage" problem. A classic illustration: screen thousands of pure-noise predictors for the ones most correlated with a random outcome, and the surviving handful will look predictive on the training data even though nothing real is there.] ::: {.callout-tip} A simple mental check: ask whether any step of your pipeline that looked at the outcome was repeated inside cross-validation. If predictor selection was done once, up front, on all the data, your reported accuracy is probably too good. ::: ### The trap, demonstrated {#sec-selbias-worked} This trap is easy to dismiss as a technicality until you watch it manufacture accuracy from nothing. We build a dataset with **no signal whatsoever** --- 100 samples, 2000 standard-normal predictors, and class labels assigned by a coin flip --- so the only honest accuracy is 50%. ```{r} #| label: selbias-data library(MASS) set.seed(1) n <- 100; p <- 2000 X <- matrix(rnorm(n * p), n, p) y <- factor(sample(c("A", "B"), n, replace = TRUE)) # labels independent of X ``` Now compare two cross-validation protocols. Both keep only the ten predictors most correlated with the label; they differ *only* in whether that selection happens before the folds are cut or inside each fold: ```{r} #| label: selbias-cv pick <- function(rows, m = 10) order(abs(cor(X[rows, ], as.numeric(y[rows]))), decreasing = TRUE)[1:m] cv_acc <- function(select_inside) { folds <- sample(rep(1:5, length.out = n)) global <- pick(1:n) # selection on ALL data (peeks at test labels) mean(sapply(1:5, function(k) { tr <- which(folds != k); ts <- which(folds == k) f <- if (select_inside) pick(tr) else global # honest: re-select using only training rows fit <- lda(X[tr, f], y[tr]) mean(predict(fit, X[ts, f, drop = FALSE])$class == y[ts]) })) } set.seed(1) c(select_outside_CV = mean(replicate(20, cv_acc(FALSE))), select_inside_CV = mean(replicate(20, cv_acc(TRUE)))) |> round(3) ``` Selecting outside the loop reports ~85% accuracy on data that contains no signal at all; selecting inside the loop reports the truthful ~50%. The inflation is entirely an artifact of letting the held-out labels influence which predictors were chosen --- with 2000 candidates, ten of them will correlate with the labels *by chance*, and once chosen they keep "working" on every fold. The fix is mechanical and non-negotiable: every outcome-dependent step, selection included, must live *inside* the resampling loop, which in practice means wrapping it in a pipeline (for example `recipes` + `workflows` in tidymodels, or `caret`'s `rfe` with selection nested in its folds) rather than running it once by hand. To summarize, wrapper methods search predictor subsets using model performance as the guide and tend to be more powerful but more expensive, while filter methods screen predictors with a cheaper external criterion that is faster but less directly aligned with model quality. Both can sharpen a model, and both will mislead you about that model's real performance unless selection is folded honestly into your validation procedure.