80  Class Imbalance

Imagine you are building a classifier to flag fraudulent credit card transactions. In a realistic dataset, perhaps 1 in 1000 transactions is fraudulent. A lazy model that simply predicts “not fraud” for every single transaction would be correct 99.9% of the time. Its accuracy looks spectacular, yet it is completely useless: it never catches a single fraud, which is the entire point of the exercise. This is the trap of class imbalance, the situation in classification problems where the relative frequencies of the classes are dramatically out of balance.

Imbalance shows up almost everywhere that the interesting event is rare: fraud detection, disease screening, equipment failure, customer churn, spam filtering, and detection of rare defects on a production line. In all of these, the class we care most about (the minority class) is the one we have the fewest examples of, and the one a naive model is most tempted to ignore.

Intuition

Most learning algorithms are built to minimize overall error. When 99.9% of the data belongs to one class, the algorithm can drive overall error very low by paying almost all of its attention to the majority class. The minority class, despite being the reason we built the model, gets drowned out.

The good news is that imbalance is a well studied problem with a toolkit of remedies. This chapter walks through that toolkit. The strategies fall into a few natural families: changing how we tune and evaluate the model, changing how we decide a class from a predicted probability, changing the prior beliefs we feed the model, changing the weight each observation carries, changing the data we train on through resampling, and changing the loss function so that mistakes on the rare class are explicitly more expensive. By the end you will know what each remedy does, when to reach for it, and what it costs you.

The remedies we will cover are:

Note

Before applying any of these techniques, stop using plain accuracy as your scorecard. With heavy imbalance, accuracy is dominated by the majority class and hides poor minority performance. Track sensitivity (the proportion of true positive cases you correctly catch), specificity (the proportion of true negatives you correctly leave alone), precision, and summary measures such as the area under the ROC curve (see Chapter 82).1

80.1 Model Tuning

The first and least invasive idea is to keep the data and the model family fixed, and simply tune the model’s hyperparameters with the right objective in mind. Instead of tuning to maximize overall accuracy, we tune to maximize the accuracy of the minority class, that is, to maximize sensitivity (the proportion of true minority cases that are correctly identified) while doing as little damage as possible to specificity (the proportion of true majority cases that are correctly identified).

In practice this means that when you search over hyperparameters by cross-validation, you select the configuration that gives the best minority-class performance rather than the best overall accuracy. The model itself is unchanged in form; only the settings you land on differ.

Warning

This is an ad hoc procedure. There is no single principled criterion that tells you the “right” trade-off between sensitivity and specificity for tuning, so the choice involves judgment about how much you value catching the rare class versus avoiding false alarms. Treat it as a useful first pass, not a final answer.

80.2 Alternative Cutoffs

Most classifiers do not actually output a class; they output a probability, and a class is assigned by comparing that probability to a cutoff. By default the cutoff is 0.5, but nothing forces that choice. When the positive class is rare, the model’s predicted probabilities for that class tend to stay below 0.5 even for genuine cases, so the default cutoff systematically misses them. The fix, in a two-category problem, is to lower the cutoff so that more cases get assigned to the minority class, raising sensitivity at the expense of specificity.

Key idea

Changing the cutoff does not change the fitted model at all. The probabilities are exactly the same. You are only changing the rule that turns a probability into a decision, which means you are choosing where to sit on the trade-off between different kinds of errors.

How do you pick the cutoff? The ROC (Receiver Operating Characteristic) curve is the natural tool, because it plots sensitivity against specificity across the entire range of possible cutoffs.2 Each point on the curve corresponds to one cutoff, so you can read off the threshold that gives the balance you want. A common heuristic is to choose the point closest to the “optimal model,” the top-left corner where sensitivity and specificity are both perfect.

There is a subtlety about where you measure this. The relationship between cutoff and performance is itself estimated from data, and with small samples the cutoff chosen on the training set is typically biased (it looks better in training than it will perform in practice).

Tip

Whenever possible, choose your cutoff using an independent dataset, or via cross-validation, rather than on the same data the model was fit to. Otherwise the apparent gain in minority-class accuracy may not survive on new data.

80.3 Adjusting Prior Probabilities

Several classifiers, naive Bayes (Chapter 18) and discriminant analysis (Chapter 20) among them, combine the evidence in the predictors with a prior probability for each class, the baseline belief about how common each class is before seeing the predictors. By default these priors are set to the class frequencies observed in the training data. Under heavy imbalance, that default prior is tiny for the minority class, which biases the model toward almost never predicting it.

The remedy is direct: override the priors by hand. Increase the prior probability assigned to the minority class and decrease it for the majority class, nudging the model to take the rare class more seriously.

Note

Like changing the cutoff, adjusting priors does not modify the underlying classification model; it only changes the prior probabilities that feed into the class decision. It is another lever on the same trade-off, expressed in the language of probability rather than thresholds.

80.4 Unequal Case Weights

A case weight lets an individual observation carry more or less emphasis during the model training phase, so that the algorithm tries harder to get the heavily weighted observations right. Many modeling procedures accept case weights directly, including boosting (Chapter 11). The natural move for imbalance is to give the minority-class observations larger weights, so that errors on them count for more when the model is fit.

Intuition

Up-weighting a rare observation is conceptually like cloning it. In several models, an observation with weight \(w\) behaves exactly as if the dataset contained \(w\) identical copies of it, same predictors, same label. Logistic regression is a clean example: duplicating a data point and weighting it by 2 give the same fitted model.

This connection between weighting and duplication is worth holding onto, because it is the bridge to the next family of methods. If up-weighting the minority class is like duplicating its observations, then perhaps we should just resample the data so that the classes are more balanced to begin with.

80.5 Sampling Methods

If we know in advance that the classes are imbalanced, we can blunt its effect on training by constructing a training set whose class frequencies are roughly equal. In the ideal case we would arrange this when we collect the data, deliberately gathering comparable numbers of each class. More often the data are already in hand and we have to rebalance after the fact, a post hoc approach.

The three standard resampling strategies are:

When to use this

Resampling is attractive when you want a balanced training set without touching the model’s machinery, and when discarding or duplicating data is acceptable. It pairs naturally with algorithms that do not expose case weights or adjustable priors.

80.5.1 Down-sampling

Down-sampling throws away data from the majority class until it is roughly the same size as the minority class. The simplest version randomly samples the majority-class observations down to the target count. A more careful version draws bootstrap samples of the majority class, which has the bonus of letting you gauge how sensitive your results are to the particular sample drawn.3

Warning

Down-sampling discards potentially useful information from the majority class, which can hurt when data are scarce. Its advantage is speed and a smaller, balanced training set; its cost is the thrown-away examples.

80.5.2 Up-sampling

Up-sampling takes the opposite tack: rather than shrinking the majority class, it grows the minority class. The traditional approach samples the minority-class observations with replacement until each class has roughly the same number of cases. As noted above, this is closely related to the Unequal Case Weights approach, since repeated copies of an observation act like a larger weight on it.

Plain duplication is not the only option. We can instead generate new minority observations, for example by imputation methods that build new cases from “nearby” data points rather than copying existing ones. More recently, the data augmentation procedures common in deep learning create fresh minority samples by transforming existing ones; with images, for instance, we can rotate, translate, flip, or recolor a minority example to manufacture plausible new ones.

Tip

Synthetic or augmented examples avoid the overfitting risk of exact duplicates, because the model sees variety rather than the same point many times. The catch is that the synthetic points must be realistic; nonsensical augmentations teach the model nonsense.

80.5.3 SMOTE

SMOTE, the Synthetic Minority Over-Sampling Technique, is the best known way to generate new minority cases rather than merely copying them. According to Chawla et al. (2002), SMOTE is a data sampling procedure that combines Up-sampling of the minority class (using K-nearest neighbors to synthesize new points) with Down-sampling of the majority class (by random sampling), applying each depending on the class.

Intuition

To invent a new minority observation, SMOTE picks an existing minority point, finds one of its nearest minority neighbors, and places a synthetic point somewhere on the line segment between them. The new point is not a copy of any real observation but a plausible blend of two, which fills in the region the minority class occupies rather than just stacking duplicates.

SMOTE exposes three operational parameters that you set to control the rebalancing:

  • the amount of up-sampling (how many synthetic minority cases to create),

  • the amount of down-sampling (how much of the majority class to keep), and

  • the number of neighbors used to impute each new case.

Warning

No resampling method is free of risk. Aggressive synthesis or down-sampling can inject bias and distort the class boundary, so the rebalanced data may no longer reflect reality. A conservative habit is to use these methods to tune the model rather than to train the final one, or at least to validate carefully on untouched, naturally distributed data.

80.6 Cost-Sensitive Training

All of the methods so far work around the learning algorithm’s objective. Cost-sensitive training confronts it head on. The cleanest and often best approach is to modify the loss function directly so that it accounts for the different costs of misclassifying different classes. If missing a fraud is far worse than a false alarm, the loss should say so explicitly, and the optimizer will then prefer models that protect the expensive class.

Key idea

Instead of tricking a cost-blind algorithm into respecting the rare class through resampling or reweighting, you tell the algorithm the truth: that some errors cost more than others. The model then optimizes the quantity you actually care about.

A closely related implementation is to add a misclassification cost term to the classification probabilities, which can be tuned to accommodate the different class frequencies and the different consequences of each error type. Either way, the costs become a transparent part of the model rather than a side adjustment.

80.7 Class Imbalance in R: A Worked Example

The chapter’s opening warning — that accuracy flatters a model on imbalanced data — is best felt as a number. Simulate a problem with about 11% positives:

Show code
suppressPackageStartupMessages(library(pROC))
set.seed(1)
n <- 3000; X <- matrix(rnorm(n*3), n, 3)
p <- plogis(-3.2 + 1.5*X[, 1] + X[, 2])
y <- factor(ifelse(runif(n) < p, "pos", "neg"), levels = c("neg", "pos"))
d <- data.frame(X, y); i <- sample(n, 0.7*n); tr <- d[i, ]; te <- d[-i, ]
prop.table(table(tr$y))
#> 
#>  neg  pos 
#> 0.89 0.11

A model that predicts “negative” for everyone already scores ~91% accuracy — while catching zero positives. Worse, a real logistic regression with genuine signal looks excellent on accuracy yet quietly misses most of the cases you actually care about:

Show code
recall <- function(pred) sum(pred == "pos" & te$y == "pos") / sum(te$y == "pos")
m  <- glm(y ~ ., binomial, tr)
ph <- predict(m, te, type = "response")
p05 <- factor(ifelse(ph > 0.5, "pos", "neg"), levels = c("neg", "pos"))
c(accuracy = mean(p05 == te$y), recall = recall(p05), AUC = as.numeric(auc(te$y, ph, quiet = TRUE)))
#>  accuracy    recall       AUC 
#> 0.9211111 0.3058824 0.8906965

Accuracy near 0.92 hides a recall near 0.31 — two of every three positives are missed at the default 0.5 cutoff. The AUC (~0.89) is the honest summary because it is threshold-free and prevalence-insensitive. Now apply two of the chapter’s remedies and watch recall recover. First, down-sample the majority class so training sees a balanced problem:

Show code
pos <- which(tr$y == "pos"); neg <- sample(which(tr$y == "neg"), length(pos))
mb  <- glm(y ~ ., binomial, tr[c(pos, neg), ])
pb  <- factor(ifelse(predict(mb, te, type = "response") > 0.5, "pos", "neg"), levels = c("neg", "pos"))
recall(pb)
#> [1] 0.8117647

Second, leave the model alone and move the cutoff (this changes the decision, not the fit) — lowering the threshold trades precision for the recall the problem demands:

Show code
pt <- factor(ifelse(ph > 0.15, "pos", "neg"), levels = c("neg", "pos"))
c(recall = recall(pt), precision = sum(pt == "pos" & te$y == "pos") / sum(pt == "pos"))
#>    recall precision 
#> 0.7647059 0.3403141

Either intervention lifts recall from ~0.31 to ~0.8, at a deliberate cost in precision that you choose with the relative cost of the two error types in mind. The mechanics differ — one reshapes the data, the other reshapes the decision — but both encode the same refusal to let raw accuracy declare victory.

To close, it helps to see these remedies as different points of intervention along the same pipeline. You can change the data before training (sampling methods), change the emphasis during training (case weights, cost-sensitive loss), change the prior beliefs the model holds (adjusting priors), or change how you turn a fitted model into a decision (alternative cutoffs, with tuning watching over all of it). They are not mutually exclusive, and in difficult problems they are often combined. The constant across every one of them is the discipline introduced at the start of the chapter: judge the model on metrics that reward catching the rare class, never on raw accuracy alone.


  1. Sensitivity is also called recall or the true positive rate; specificity is the true negative rate. A confusion matrix breaks predictions into true positives, false positives, true negatives, and false negatives, and every metric above is just a ratio of those four counts.↩︎

  2. The ROC curve traces the true positive rate against the false positive rate as the cutoff sweeps from 0 to 1. A model that is no better than guessing lies on the diagonal; a perfect model hugs the top-left corner.↩︎

  3. Repeating the down-sampling with different bootstrap draws and refitting each time gives a spread of results, which is a rough measure of the uncertainty introduced by the sampling itself.↩︎