Advanced Data Analysis

Nguyen, Mike

16 Artificial Intelligence

Modern machine learning models, especially deep neural networks, are often remarkably accurate and almost completely opaque. A network with millions of weights can predict whether an image contains a tumor, yet give us no plain-language reason for its verdict. For low-stakes tasks that may be fine, but in medicine, lending, hiring, and criminal justice we usually need more: we want to know why a model made a decision, how confident it is, and which inputs drove the answer.

This chapter introduces the tools that turn black-box predictions into something a person can reason about. We focus on two questions that the rest of the chapter circles back to repeatedly. First, how sure is the model? That is the problem of uncertainty quantification. Second, why did the model produce this prediction for this input? That is the problem of explainability. Along the way we meet a few named methods you will see again and again in practice: mixture density networks, LIME, and Shapley values (including their popular implementation, SHAP).

Key idea

Accuracy is not the same as trustworthiness. A model you cannot question is hard to trust, debug, audit, or defend. The methods in this chapter exist to make powerful but opaque models accountable.

16.1 Explainable AI

The umbrella term for this effort is explainable AI, usually abbreviated XAI. It grew out of long-standing criticisms of deep learning. Two complaints stand out. The first is that standard deep networks give a single point prediction with no honest sense of how uncertain that prediction is. The second is that, unlike classical statistical models, they do not readily support inference: it is hard to say which inputs actually matter and by how much. XAI tries to address these gaps, and it folds in related goals such as fairness and transparency (the broader toolkit of model-agnostic interpretation is developed in Chapter 35).¹

The chapter is organized around the two themes those criticisms point to. We first sketch the main families of uncertainty quantification, then turn to explainability and study a few representative methods in depth.

16.1.1 Uncertainty Quantification

Uncertainty quantification (UQ) asks the model not only “what is your prediction?” but “how much should I trust it?” There are four broad strategies for getting a neural network to report uncertainty, each with a different flavor.

The first is variational, or Bayesian, inference, which treats the network weights as random variables and approximates a posterior distribution over them rather than settling on a single best set of weights. The second is Monte Carlo dropout, which reinterprets the familiar dropout trick (randomly switching off neurons during training) as a principled UQ procedure; it can be justified from the perspective of deep Gaussian processes (see the Gaussian process regression chapter, Chapter 7), so the same dropout you used for regularization doubles as a way to sample predictions. The third is the Mixture Density Networks approach, in which the network outputs an entire predictive distribution (a mixture of Gaussians) instead of just a mean. The fourth is ensembling, where you train several models, or several realizations of the same model, and read off the spread of their predictions as a measure of uncertainty.

Intuition

All four methods replace a single guess with a spread of plausible answers. They differ mainly in where the randomness comes from: from the weights (Bayesian, dropout), from the output layer (mixture density), or from training many models (ensembles).

We expand on the mixture density approach below because it is the most self-contained of the four and ties directly to material elsewhere in the book.

16.1.2 Explainability

Explainability asks the complementary question: given a prediction, which features were responsible, and how? Samek et al. (2021) group the main techniques into four families. We list them here and explain the two most widely used (LIME and Shapley values) in their own sections.

The four primary approaches are:

Interpretable local surrogates, which fit a simple, readable model that mimics the black box near one input. LIME is the canonical example.
Occlusion analysis, which measures how the prediction changes when you hide or perturb parts of the input. Shapley values, SHAP, Kernel SHAP, and meaningful perturbation all live here.
Integrated gradients, which attribute the prediction to inputs by accumulating gradients along a path from a baseline to the actual input; SmoothGrad is a refinement.
Layerwise relevance propagation, which traces a prediction backward through the network, redistributing “relevance” from the output to the input features.

Beyond these four, two design choices can build explainability into the model itself rather than bolting it on afterward. A self-explainable model produces explanations as part of its normal operation, for example through attention mechanisms (see Chapter 38) that reveal which inputs the model attended to. A specialized model encodes structure that is inherently interpretable, such as graph neural networks (Chapter 44) that respect a known relational structure in the data.

Warning

Explanations can be gamed. It is possible to adjust a model so that it makes the same predictions while producing explanations that better fit a story someone wants to tell. An explanation is evidence, not proof; treat a too-convenient explanation with suspicion, and prefer methods and audits you cannot quietly tune.

With the landscape in view, we now study three workhorse methods in detail.

16.1.3 Mixture Density Networks

Most neural networks for regression predict a single number: the conditional mean of the output given the inputs. That throws away information about how spread out or multi-modal the answer could be. A mixture density network keeps that information by predicting a whole distribution.

The starting point is a classical fact: any reasonable distribution can be approximated arbitrarily well by a mixture of Gaussians, given enough components. We can therefore write the conditional distribution of the output as

\[ p(y|\mathbf{x}) = \sum_{k=1}^K \alpha_k (\mathbf{x}) \phi(\mu_k(\mathbf{x}), \sigma_k (\mathbf{x})) \]

where $\alpha_k(\mathbf{x})$ are mixture coefficients associated with the Gaussian distribution defined by mean and standard deviation $\phi(\mu_k(\mathbf{x}), \sigma_k (\mathbf{x})$ and all of these parameters depend on the inputs $\mathbf{x}$.

The trick that makes this a network is what the network is asked to output. Instead of training the neural network to estimate the conditional mean, we estimate all the mixture parameters $\{ \alpha_k(\mathbf{x}), \mu_k (\mathbf{x}), \sigma_k (\mathbf{x})\}$ (i.e., these parameters are the output of the NN). Given an input, the network emits the weights, means, and standard deviations of the mixture, and those define a full predictive distribution rather than a point estimate.

This method can use mixtures of other distributions as well.²

When to use this

Reach for a mixture density network when the conditional distribution of the output is skewed, heavy-tailed, or multi-modal, so that a single mean (and even a single variance) would be misleading. The classic example is an inverse problem where one input can plausibly map to several distinct outputs.

16.1.4 Local Interpretable Model-Agnostic Explanations

Local Interpretable Model-Agnostic Explanations, denoted LIME (Ribeiro, Singh, and Guestrin 2016), explains one prediction at a time by fitting a simple model that is faithful to the black box near that one input. The name spells out the three commitments. Local means it explains individual predictions, building a small surrogate model that holds only in the neighborhood of the instance of interest. Interpretable means that surrogate is something a human can read, such as a short linear model. Model-agnostic means it works for any underlying model, because it never looks inside the black box; it only queries its predictions.

The motivation is practical. It is hard to read off, from a complicated model, which variable mattered for a particular output. LIME reframes the question as feature attribution: the credit assigned to each base feature (the inputs, not the model’s internal parameters) is interpreted as that feature’s importance to the prediction. By treating every model as a black box and asking only why it made each individual prediction, LIME lets a user decide whether to trust a given output, which is often exactly the decision that matters in deployment.

The engine behind this is the idea of a local surrogate model: a simple model that approximates the black box’s behavior in a small region. Rather than fit one global surrogate to the whole model (which would force the simple model to be unrealistically flexible), LIME fits many local surrogates, one per prediction it is asked to explain. Each surrogate need only be accurate in the neighborhood of its own instance, a requirement called local faithfulness.

The procedure for building one local surrogate is as follows:

Select a prediction from the black-box model that you want to explain.
Perturb the dataset and get the black-box predictions for these new points.
Weight the new samples based on their proximity to the point of interest (i.e., your chosen prediction).
Train a weighted, interpretable model (e.g., regression) on the perturbed dataset, using the black-box predictions as the response.
Explain the prediction by interpreting the local model.

Intuition

To learn how a complicated surface behaves at one point, zoom in until it looks roughly flat, then fit a straight line there. LIME does exactly this: it jiggles the input, watches how the black box responds nearby, and fits a simple, readable model to that local response.

Formally, LIME produces an explanation for a given instance of $(\mathbf{y,x})$, where the goal is to find which of the $\mathbf{x}$ features are most responsible for a given prediction $\mathbf{\hat{y}}$. It does so by solving

\[ \text{explanation}(\mathbf{x}) = \underset{g \in G}{\operatorname{argmax}} L (f, g, \pi_x) + \Omega(g) \]

where the pieces are:

$f$: the original black-box model.
$g$: the explanation model for the instance.
$G$: the family of possible explanation models (e.g., linear regression models).
$\pi_x$: a proximity measure that determines how large the neighborhood is around $\mathbf{x}$.
$\Omega(g)$: model complexity, which we want to be small so the explanation uses few features.

In words, the objective trades off two things: $L(f, g, \pi_x)$ rewards a surrogate $g$ that matches the black box $f$ where it counts (close to $\mathbf{x}$, as judged by $\pi_x$), while $\Omega(g)$ penalizes complicated surrogates so the explanation stays short and readable.

The neighborhood itself is set by the proximity measure. Here $\pi_x$ is an exponential smoothing kernel, and its width is the key dial. A small kernel width means an observation must sit very close to $\mathbf{x}$ before it influences the local model, giving a tightly local explanation. A large kernel width lets observations quite different from $\mathbf{x}$ pull on the local model, giving a broader, less local explanation. Kernel widths are unavoidably subjective, so in practice you experiment with several and keep the ones that yield meaningful explanations.

Warning

Because the kernel width is a free choice and the explanation can shift with it, two analysts can produce different “explanations” of the same prediction. Report the width you used and check that the explanation is stable across a reasonable range.

A few practical notes round this out. LIME is straightforward to use in R through the lime or iml package.³ The difficult part is not running it but evaluating how well the local surrogate actually estimates the complex model in that neighborhood, since a poor local fit quietly undermines the interpretation. A heat map is a convenient way to display how all the variables influence the predictions at once.

For further reading on LIME, see:

https://www.oreilly.com/content/introduction-to-local-interpretable-model-agnostic-explanations-lime/
https://homes.cs.washington.edu/~marcotcr/blog/lime/
Ribeiro, Singh, and Guestrin (2016)

16.1.5 Shapley Values

Where LIME fits a local model, Shapley values take a game-theoretic route to the same attribution problem. They give a principled way to explain the output of any machine learning model, and they migrated into statistics from economics. They also combine naturally with other approaches, including LIME, as we will see when we reach Kernel SHAP. The shared goal with LIME is the same: distribute a model’s prediction score for a specific input among its base features.

The idea comes from cooperative game theory. Shapley values were introduced in 1951 by Lloyd Shapley as a way to value each player’s contribution in a cooperative game, work that contributed to his Nobel Prize in economics. The recipe, in its original setting, is to determine the value of players individually and in subsets, compute each player’s marginal value across all possible orderings of the players, and then average those marginal values to get the player’s Shapley value.

Intuition

Imagine team members joining a project one at a time, in every possible order. Each time someone joins, note how much the team’s output goes up; that bump is their marginal contribution for that ordering. A player’s Shapley value is the average of these bumps over all orderings. Swap “player” for “feature” and “team output” for “prediction” and you have feature attribution.

To make this precise, consider a game with $N$ players. The characteristic (value) function $\nu$ maps subsets $S \in \{0, 1,\dots, N\}$ to a real value $\nu(S)$, the value to be gained from that coalition of players cooperating. The marginal contribution of player $i$ to a coalition $S$ is

\[ \Delta_\nu (i,S) = \nu (S \cup i) - \nu(S) \]

that is, how much the coalition’s value rises when $i$ joins it. The Shapley value (player $i$’s contribution to the full “grand coalition”) is the suitably weighted average of these marginal contributions:

\[ \phi_i (\nu) = \frac{1}{N} \sum_{S \subseteq N_{\{i\}}} \left( \begin{array} {c} n-1 \\ |S| \end{array} \right) ^{-1} \Delta_\nu (i,S) \]

which reads, in words, as

\[ \varphi (\nu) = \frac{1}{\text{number of players}} \sum_{\text{coalitions excluding i}} \frac{\text{marginal contribution of i to coalition}}{\text{number of coalitions excluding i of this size}} \]

where $x_S$ has missing values for the features not in set $S$.

Part of why Shapley values are so attractive is that they are the unique attribution satisfying a short list of fairness-like axioms. Those properties are:

Uniqueness: the Shapley value is the only attribution that satisfies the remaining properties together.
Symmetry: if two players have equal marginal contributions, they receive equal Shapley values.
Dummy: if a player’s marginal contribution is always 0, its Shapley value is 0.
Additivity: for a single player $i$ with two value functions $\nu$ and $w$,

\[ \phi_i(\nu) + \phi_i (w) = \phi_i (\nu + w) \]

These axioms are exactly what make the attributions line up with human intuitions about fair credit, which is why Shapley values are useful for ensuring that feature attribution corresponds to human understanding.

Key idea

Shapley values are the unique feature attribution that is fair in a precise, axiomatic sense (symmetry, dummy, additivity). That uniqueness is the main reason they have become a default explanation tool.

To use this for predictive modeling, treat a single prediction as the “game”: the features $x_1, \dots, x_p$ are the players, and the value function $\nu_f$ is defined with respect to the model $f$. Then the Shapley value for feature $i$ is the influence of that feature on the outcome. Everything hinges on how one defines the value function $\nu_f$, because that is what turns an abstract game into a concrete statement about a model.

Researchers have proposed two broad families of value functions. The conditional expectation variants ask what the model predicts on average given the known features, and include Shapley regression values (Lipovetsky and Conklin 2001b; Lipovetsky and Conklin 2001a), Shapley sampling values (Štrumbelj and Kononenko 2013), and SHAP (Lundberg and Lee 2017). The intervention variants instead actively set features and measure the effect, and include quantitative input influence (Datta, Sen, and Zick 2016) and the Formulate, Approximate, Explain framework (Merrick and Taly 2020).

16.1.5.1 SHAP (Shapley Additive Explanations)

SHapley Additive exPlanation (SHAP) values are the most popular conditional-expectation variant. They define the value function as

\[ \nu_{f,x}(S) = f(x_s) = E[f(x)|x_s] = E_{x_{\bar{S}}|x_S}[f(x)] \\ \text{assume feature independence} \approx E_{x_{\bar{S}}}[f(x)] \\ \text{assume model linearity} \approx f([x_s, E[x_{\bar{S}}]]) \]

where $x_S$ has missing values for the features not in set $S$. Read top to bottom, the value of a coalition is the model’s expected output given the known features; the two approximations below it are the practical shortcuts (assume features are independent, then assume the model is locally linear) that make the expectation computable.

Exact computation is challenging because it sums over all feature subsets, but there are many model-agnostic and model-type-specific approximation methods, with varying assumptions such as the feature independence and model linearity shown in the equation above.

The model-agnostic approximations work for any model. If we assume feature independence, we can use the sampling approximations of Štrumbelj and Kononenko (2013) and Datta, Sen, and Zick (2016), which are reasonable to compute only for a small number of inputs. Kernel SHAP, which is essentially LIME combined with Shapley values, reformulates the problem as a weighted linear regression and jointly estimates all the SHAP values at once.⁴

The model-type-specific approximations exploit known structure for big speedups. They include Linear SHAP, Low-order SHAP, Max SHAP, Deep SHAP (DeepLift combined with Shapley values), and Tree SHAP (Lundberg, Erion, and Lee 2018), the last of which makes exact SHAP values tractable for tree ensembles.

For implementations, you have several choices depending on your language and model type:

Python: shap, developed by Lundberg, provides Kernel SHAP plus model-type-specific SHAP algorithms, with support for TensorFlow/Keras/PyTorch models.
R: shapr provides Kernel SHAP plus the Aas, Jullum, and Løland (2021) method for accounting for feature dependence, with support for gam, lm, glm, xgb.Booster, and ranger; shapper is a port of the Python shap library.
Others: fastshap, iml, shap-value, DALEX, and shapleyR.

Tip

For tree ensembles (random forests, gradient boosting), prefer Tree SHAP. It computes exact Shapley values quickly by using the tree structure, instead of the slow sampling needed in the model-agnostic case.

Shapley values are powerful but not above criticism, and the critiques fall into two groups. The mathematical issues concern what the numbers really mean. A feature with no direct effect on the response can still be assigned influence if it is predictive of other features (indirect influence). It is unclear how statistically related features should be handled, for instance whether correlated features should be treated as a single “player.” And intervention-based methods can rely on out-of-distribution sampling, extrapolating into regions of feature space where no real data live.

The human-interpretability issues concern how people read the numbers. A Shapley value answers the question “Why $f(x)$ instead of $E[f(x)]$?”, yet we may never actually observe data with the baseline outcome $E[f(x)]$, so the comparison can feel abstract. Shapley values also do not tell you how to change an outcome (they are descriptive, not prescriptive), and they are easy to misinterpret as causal when they are not.

Warning

A SHAP value is not a causal effect and not a recipe for action. It explains why this prediction differs from the average prediction, under specific assumptions about feature dependence. When features are correlated, read the attributions with extra care.

For further reading on Shapley values and their use in machine learning, see Datta, Sen, and Zick (2016), Kumar et al. (2020), Lipovetsky and Conklin (2001a), Lundberg, Erion, and Lee (2018), Lundberg and Lee (2017), Merrick and Taly (2020), Štrumbelj and Kononenko (2013), and Sundararajan and Najmi (2020).

Aas, Kjersti, Martin Jullum, and Anders Løland. 2021. “Explaining Individual Predictions When Features Are Dependent: More Accurate Approximations to Shapley Values.” Artificial Intelligence 298 (September): 103502. https://doi.org/10.1016/j.artint.2021.103502.

Abdar, Moloud, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, et al. 2021. “A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges.” Information Fusion 76 (December): 243–97. https://doi.org/10.1016/j.inffus.2021.05.008.

Datta, Anupam, Shayak Sen, and Yair Zick. 2016. “Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems.” 2016 IEEE Symposium on Security and Privacy (SP), May. https://doi.org/10.1109/sp.2016.42.

Kumar, I Elizabeth, Suresh Venkatasubramanian, Carlos Scheidegger, and Sorelle Friedler. 2020. “Problems with Shapley-Value-Based Explanations as Feature Importance Measures.” In International Conference on Machine Learning, 5491–5500. PMLR.

Lipovetsky, Stan, and Michael Conklin. 2001a. “Analysis of Regression in Game Theory Approach.” Applied Stochastic Models in Business and Industry 17 (4): 319–30. https://doi.org/10.1002/asmb.446.

Lipovetsky, Stan, and W.Michael Conklin. 2001b. “Multiobjective Regression Modifications for Collinearity.” Computers and Operations Research 28 (13): 1333–45. https://doi.org/10.1016/s0305-0548(00)00043-5.

Lundberg, Scott M, Gabriel G Erion, and Su-In Lee. 2018. “Consistent Individualized Feature Attribution for Tree Ensembles.” arXiv Preprint arXiv:1802.03888.

Lundberg, Scott M, and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” In Proceedings of the 31st International Conference on Neural Information Processing Systems, 4768–77.

Merrick, Luke, and Ankur Taly. 2020. “The Explanation Game: Explaining Machine Learning Models Using Shapley Values.” In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, 17–38. Springer.

Ribeiro, Marco, Sameer Singh, and Carlos Guestrin. 2016. ““Why Should i Trust You?”: Explaining the Predictions of Any Classifier.” Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. https://doi.org/10.18653/v1/n16-3020.

Samek, Wojciech, Gregoire Montavon, Sebastian Lapuschkin, Christopher J. Anders, and Klaus-Robert Muller. 2021. “Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications.” Proceedings of the IEEE 109 (3): 247–78. https://doi.org/10.1109/jproc.2021.3060483.

Štrumbelj, Erik, and Igor Kononenko. 2013. “Explaining Prediction Models and Individual Predictions with Feature Contributions.” Knowledge and Information Systems 41 (3): 647–65. https://doi.org/10.1007/s10115-013-0679-x.

Sundararajan, Mukund, and Amir Najmi. 2020. “The Many Shapley Values for Model Explanation.” In International Conference on Machine Learning, 9269–78. PMLR.

For broad surveys of the field, see Samek et al. (2021) (explainability) and Abdar et al. (2021) (uncertainty quantification). They are good entry points if you want a map of the whole landscape before diving into any single method.↩︎
The mixture-of-Gaussians presentation follows Bishop’s Pattern Recognition and Machine Learning (2006), where mixture density networks are developed in detail. The Gaussian case is the usual default, but nothing in the construction requires it.↩︎
If you have a fitted model and want to try this, lime follows the steps above: build an explainer from the training data and model, then call explain() on the rows you want to interpret. The harder, less automatable part is judging quality, not running the code.↩︎
This is the bridge promised earlier: Kernel SHAP borrows LIME’s local-surrogate machinery but chooses the weighting so that the resulting coefficients are Shapley values, inheriting their fairness axioms.↩︎

# Artificial Intelligence {#sec-ai} ```{r} #| include: false source("_common.R") ``` Modern machine learning models, especially deep neural networks, are often remarkably accurate and almost completely opaque. A network with millions of weights can predict whether an image contains a tumor, yet give us no plain-language reason for its verdict. For low-stakes tasks that may be fine, but in medicine, lending, hiring, and criminal justice we usually need more: we want to know *why* a model made a decision, *how confident* it is, and *which inputs* drove the answer. This chapter introduces the tools that turn black-box predictions into something a person can reason about. We focus on two questions that the rest of the chapter circles back to repeatedly. First, how sure is the model? That is the problem of uncertainty quantification. Second, why did the model produce *this* prediction for *this* input? That is the problem of explainability. Along the way we meet a few named methods you will see again and again in practice: mixture density networks, LIME, and Shapley values (including their popular implementation, SHAP). ::: {.callout-important title="Key idea"} Accuracy is not the same as trustworthiness. A model you cannot question is hard to trust, debug, audit, or defend. The methods in this chapter exist to make powerful but opaque models accountable. ::: ## Explainable AI The umbrella term for this effort is explainable AI, usually abbreviated XAI. It grew out of long-standing criticisms of deep learning. Two complaints stand out. The first is that standard deep networks give a single point prediction with no honest sense of how uncertain that prediction is. The second is that, unlike classical statistical models, they do not readily support inference: it is hard to say which inputs actually matter and by how much. XAI tries to address these gaps, and it folds in related goals such as fairness and transparency (the broader toolkit of model-agnostic interpretation is developed in @sec-interpretable-machine-learning).^[For broad surveys of the field, see @samek2021a (explainability) and @abdar2021 (uncertainty quantification). They are good entry points if you want a map of the whole landscape before diving into any single method.] The chapter is organized around the two themes those criticisms point to. We first sketch the main families of uncertainty quantification, then turn to explainability and study a few representative methods in depth. ### Uncertainty Quantification Uncertainty quantification (UQ) asks the model not only "what is your prediction?" but "how much should I trust it?" There are four broad strategies for getting a neural network to report uncertainty, each with a different flavor. The first is variational, or Bayesian, inference, which treats the network weights as random variables and approximates a posterior distribution over them rather than settling on a single best set of weights. The second is Monte Carlo dropout, which reinterprets the familiar dropout trick (randomly switching off neurons during training) as a principled UQ procedure; it can be justified from the perspective of deep Gaussian processes (see the Gaussian process regression chapter, @sec-gaussian-process-reg), so the same dropout you used for regularization doubles as a way to sample predictions. The third is the [Mixture Density Networks] approach, in which the network outputs an entire predictive distribution (a mixture of Gaussians) instead of just a mean. The fourth is ensembling, where you train several models, or several realizations of the same model, and read off the spread of their predictions as a measure of uncertainty. ::: {.callout-tip title="Intuition"} All four methods replace a single guess with a *spread* of plausible answers. They differ mainly in *where* the randomness comes from: from the weights (Bayesian, dropout), from the output layer (mixture density), or from training many models (ensembles). ::: We expand on the mixture density approach below because it is the most self-contained of the four and ties directly to material elsewhere in the book. ### Explainability Explainability asks the complementary question: given a prediction, which features were responsible, and how? @samek2021a group the main techniques into four families. We list them here and explain the two most widely used (LIME and Shapley values) in their own sections. The four primary approaches are: - Interpretable local surrogates, which fit a simple, readable model that mimics the black box near one input. LIME is the canonical example. - Occlusion analysis, which measures how the prediction changes when you hide or perturb parts of the input. Shapley values, SHAP, Kernel SHAP, and meaningful perturbation all live here. - Integrated gradients, which attribute the prediction to inputs by accumulating gradients along a path from a baseline to the actual input; SmoothGrad is a refinement. - Layerwise relevance propagation, which traces a prediction backward through the network, redistributing "relevance" from the output to the input features. Beyond these four, two design choices can build explainability into the model itself rather than bolting it on afterward. A self-explainable model produces explanations as part of its normal operation, for example through attention mechanisms (see @sec-transformers) that reveal which inputs the model attended to. A specialized model encodes structure that is inherently interpretable, such as graph neural networks (@sec-graph-neural-networks) that respect a known relational structure in the data. ::: {.callout-warning} Explanations can be gamed. It is possible to adjust a model so that it makes the *same* predictions while producing explanations that better fit a story someone wants to tell. An explanation is evidence, not proof; treat a too-convenient explanation with suspicion, and prefer methods and audits you cannot quietly tune. ::: With the landscape in view, we now study three workhorse methods in detail. ### Mixture Density Networks Most neural networks for regression predict a single number: the conditional mean of the output given the inputs. That throws away information about how spread out or multi-modal the answer could be. A mixture density network keeps that information by predicting a whole distribution. The starting point is a classical fact: any reasonable distribution can be approximated arbitrarily well by a mixture of Gaussians, given enough components. We can therefore write the conditional distribution of the output as $$ p(y|\mathbf{x}) = \sum_{k=1}^K \alpha_k (\mathbf{x}) \phi(\mu_k(\mathbf{x}), \sigma_k (\mathbf{x})) $$ where $\alpha_k(\mathbf{x})$ are mixture coefficients associated with the Gaussian distribution defined by mean and standard deviation $\phi(\mu_k(\mathbf{x}), \sigma_k (\mathbf{x})$ and all of these parameters depend on the inputs $\mathbf{x}$. The trick that makes this a *network* is what the network is asked to output. Instead of training the neural network to estimate the conditional mean, we estimate all the mixture parameters $\{ \alpha_k(\mathbf{x}), \mu_k (\mathbf{x}), \sigma_k (\mathbf{x})\}$ (i.e., these parameters are the output of the NN). Given an input, the network emits the weights, means, and standard deviations of the mixture, and those define a full predictive distribution rather than a point estimate. This method can use mixtures of other distributions as well.^[The mixture-of-Gaussians presentation follows Bishop's *Pattern Recognition and Machine Learning* (2006), where mixture density networks are developed in detail. The Gaussian case is the usual default, but nothing in the construction requires it.] ::: {.callout-tip title="When to use this"} Reach for a mixture density network when the conditional distribution of the output is skewed, heavy-tailed, or multi-modal, so that a single mean (and even a single variance) would be misleading. The classic example is an inverse problem where one input can plausibly map to several distinct outputs. ::: ### Local Interpretable Model-Agnostic Explanations Local Interpretable Model-Agnostic Explanations, denoted LIME [@ribeiro2016], explains one prediction at a time by fitting a simple model that is faithful to the black box *near that one input*. The name spells out the three commitments. Local means it explains individual predictions, building a small surrogate model that holds only in the neighborhood of the instance of interest. Interpretable means that surrogate is something a human can read, such as a short linear model. Model-agnostic means it works for any underlying model, because it never looks inside the black box; it only queries its predictions. The motivation is practical. It is hard to read off, from a complicated model, which variable mattered for a particular output. LIME reframes the question as feature attribution: the credit assigned to each base feature (the inputs, not the model's internal parameters) is interpreted as that feature's importance to the prediction. By treating every model as a black box and asking only why it made each individual prediction, LIME lets a user decide whether to trust a given output, which is often exactly the decision that matters in deployment. The engine behind this is the idea of a local surrogate model: a simple model that approximates the black box's behavior in a small region. Rather than fit one global surrogate to the whole model (which would force the simple model to be unrealistically flexible), LIME fits many local surrogates, one per prediction it is asked to explain. Each surrogate need only be accurate in the neighborhood of its own instance, a requirement called local faithfulness. The procedure for building one local surrogate is as follows: 1. Select a prediction from the black-box model that you want to explain. 2. Perturb the dataset and get the black-box predictions for these new points. 3. Weight the new samples based on their proximity to the point of interest (i.e., your chosen prediction). 4. Train a weighted, interpretable model (e.g., regression) on the perturbed dataset, using the black-box predictions as the response. 5. Explain the prediction by interpreting the local model. ::: {.callout-tip title="Intuition"} To learn how a complicated surface behaves at one point, zoom in until it looks roughly flat, then fit a straight line there. LIME does exactly this: it jiggles the input, watches how the black box responds nearby, and fits a simple, readable model to that local response. ::: Formally, LIME produces an explanation for a given instance of $(\mathbf{y,x})$, where the goal is to find which of the $\mathbf{x}$ features are most responsible for a given prediction $\mathbf{\hat{y}}$. It does so by solving $$ \text{explanation}(\mathbf{x}) = \underset{g \in G}{\operatorname{argmax}} L (f, g, \pi_x) + \Omega(g) $$ where the pieces are: - $f$: the original black-box model. - $g$: the explanation model for the instance. - $G$: the family of possible explanation models (e.g., linear regression models). - $\pi_x$: a proximity measure that determines how large the neighborhood is around $\mathbf{x}$. - $\Omega(g)$: model complexity, which we want to be small so the explanation uses few features. In words, the objective trades off two things: $L(f, g, \pi_x)$ rewards a surrogate $g$ that matches the black box $f$ where it counts (close to $\mathbf{x}$, as judged by $\pi_x$), while $\Omega(g)$ penalizes complicated surrogates so the explanation stays short and readable. The neighborhood itself is set by the proximity measure. Here $\pi_x$ is an exponential smoothing kernel, and its width is the key dial. A small kernel width means an observation must sit very close to $\mathbf{x}$ before it influences the local model, giving a tightly local explanation. A large kernel width lets observations quite different from $\mathbf{x}$ pull on the local model, giving a broader, less local explanation. Kernel widths are unavoidably subjective, so in practice you experiment with several and keep the ones that yield meaningful explanations. ::: {.callout-warning} Because the kernel width is a free choice and the explanation can shift with it, two analysts can produce different "explanations" of the same prediction. Report the width you used and check that the explanation is stable across a reasonable range. ::: A few practical notes round this out. LIME is straightforward to use in R through the `lime` or `iml` package.^[If you have a fitted model and want to try this, `lime` follows the steps above: build an explainer from the training data and model, then call `explain()` on the rows you want to interpret. The harder, less automatable part is judging quality, not running the code.] The difficult part is not running it but evaluating how well the local surrogate actually estimates the complex model in that neighborhood, since a poor local fit quietly undermines the interpretation. A heat map is a convenient way to display how all the variables influence the predictions at once. For further reading on LIME, see: - <https://www.oreilly.com/content/introduction-to-local-interpretable-model-agnostic-explanations-lime/> - <https://homes.cs.washington.edu/~marcotcr/blog/lime/> - @ribeiro2016 ### Shapley Values Where LIME fits a local model, Shapley values take a game-theoretic route to the same attribution problem. They give a principled way to explain the output of *any* machine learning model, and they migrated into statistics from economics. They also combine naturally with other approaches, including LIME, as we will see when we reach Kernel SHAP. The shared goal with LIME is the same: distribute a model's prediction score for a specific input among its base features. The idea comes from cooperative game theory. Shapley values were introduced in 1951 by Lloyd Shapley as a way to value each player's contribution in a cooperative game, work that contributed to his Nobel Prize in economics. The recipe, in its original setting, is to determine the value of players individually and in subsets, compute each player's marginal value across all possible orderings of the players, and then average those marginal values to get the player's Shapley value. ::: {.callout-tip title="Intuition"} Imagine team members joining a project one at a time, in every possible order. Each time someone joins, note how much the team's output goes up; that bump is their marginal contribution for that ordering. A player's Shapley value is the average of these bumps over all orderings. Swap "player" for "feature" and "team output" for "prediction" and you have feature attribution. ::: To make this precise, consider a game with $N$ players. The characteristic (value) function $\nu$ maps subsets $S \in \{0, 1,\dots, N\}$ to a real value $\nu(S)$, the value to be gained from that coalition of players cooperating. The marginal contribution of player $i$ to a coalition $S$ is $$ \Delta_\nu (i,S) = \nu (S \cup i) - \nu(S) $$ that is, how much the coalition's value rises when $i$ joins it. The Shapley value (player $i$'s contribution to the full "grand coalition") is the suitably weighted average of these marginal contributions: $$ \phi_i (\nu) = \frac{1}{N} \sum_{S \subseteq N_{\{i\}}} \left( \begin{array} {c} n-1 \\ |S| \end{array} \right) ^{-1} \Delta_\nu (i,S) $$ which reads, in words, as $$ \varphi (\nu) = \frac{1}{\text{number of players}} \sum_{\text{coalitions excluding i}} \frac{\text{marginal contribution of i to coalition}}{\text{number of coalitions excluding i of this size}} $$ where $x_S$ has missing values for the features not in set $S$. Part of why Shapley values are so attractive is that they are the *unique* attribution satisfying a short list of fairness-like axioms. Those properties are: - Uniqueness: the Shapley value is the only attribution that satisfies the remaining properties together. - Symmetry: if two players have equal marginal contributions, they receive equal Shapley values. - Dummy: if a player's marginal contribution is always 0, its Shapley value is 0. - Additivity: for a single player $i$ with two value functions $\nu$ and $w$, $$ \phi_i(\nu) + \phi_i (w) = \phi_i (\nu + w) $$ These axioms are exactly what make the attributions line up with human intuitions about fair credit, which is why Shapley values are useful for ensuring that feature attribution corresponds to human understanding. ::: {.callout-important title="Key idea"} Shapley values are the unique feature attribution that is fair in a precise, axiomatic sense (symmetry, dummy, additivity). That uniqueness is the main reason they have become a default explanation tool. ::: To use this for predictive modeling, treat a single prediction as the "game": the features $x_1, \dots, x_p$ are the players, and the value function $\nu_f$ is defined with respect to the model $f$. Then the Shapley value for feature $i$ is the influence of that feature on the outcome. Everything hinges on how one defines the value function $\nu_f$, because that is what turns an abstract game into a concrete statement about a model. Researchers have proposed two broad families of value functions. The conditional expectation variants ask what the model predicts on average given the known features, and include Shapley regression values [@lipovetsky2001; @lipovetsky2001a], Shapley sampling values [@trumbelj2013], and SHAP [@lundberg2017unified]. The intervention variants instead actively set features and measure the effect, and include quantitative input influence [@datta2016] and the Formulate, Approximate, Explain framework [@merrick2020explanation]. #### SHAP (Shapley Additive Explanations) SHapley Additive exPlanation (SHAP) values are the most popular conditional-expectation variant. They define the value function as $$ \nu_{f,x}(S) = f(x_s) = E[f(x)|x_s] = E_{x_{\bar{S}}|x_S}[f(x)] \\ \text{assume feature independence} \approx E_{x_{\bar{S}}}[f(x)] \\ \text{assume model linearity} \approx f([x_s, E[x_{\bar{S}}]]) $$ where $x_S$ has missing values for the features not in set $S$. Read top to bottom, the value of a coalition is the model's expected output given the known features; the two approximations below it are the practical shortcuts (assume features are independent, then assume the model is locally linear) that make the expectation computable. Exact computation is challenging because it sums over all feature subsets, but there are many model-agnostic and model-type-specific approximation methods, with varying assumptions such as the feature independence and model linearity shown in the equation above. The model-agnostic approximations work for any model. If we assume feature independence, we can use the sampling approximations of @trumbelj2013 and @datta2016, which are reasonable to compute only for a small number of inputs. Kernel SHAP, which is essentially LIME combined with Shapley values, reformulates the problem as a weighted linear regression and jointly estimates all the SHAP values at once.^[This is the bridge promised earlier: Kernel SHAP borrows LIME's local-surrogate machinery but chooses the weighting so that the resulting coefficients *are* Shapley values, inheriting their fairness axioms.] The model-type-specific approximations exploit known structure for big speedups. They include Linear SHAP, Low-order SHAP, Max SHAP, Deep SHAP (DeepLift combined with Shapley values), and Tree SHAP [@lundberg2018consistent], the last of which makes exact SHAP values tractable for tree ensembles. For implementations, you have several choices depending on your language and model type: - Python: `shap`, developed by Lundberg, provides Kernel SHAP plus model-type-specific SHAP algorithms, with support for TensorFlow/Keras/PyTorch models. - R: `shapr` provides Kernel SHAP plus the @aas2021 method for accounting for feature dependence, with support for `gam`, `lm`, `glm`, `xgb.Booster`, and `ranger`; `shapper` is a port of the Python `shap` library. - Others: `fastshap`, `iml`, shap-value, `DALEX`, and `shapleyR`. ::: {.callout-tip} For tree ensembles (random forests, gradient boosting), prefer Tree SHAP. It computes exact Shapley values quickly by using the tree structure, instead of the slow sampling needed in the model-agnostic case. ::: Shapley values are powerful but not above criticism, and the critiques fall into two groups. The mathematical issues concern what the numbers really mean. A feature with no direct effect on the response can still be assigned influence if it is predictive of other features (indirect influence). It is unclear how statistically related features should be handled, for instance whether correlated features should be treated as a single "player." And intervention-based methods can rely on out-of-distribution sampling, extrapolating into regions of feature space where no real data live. The human-interpretability issues concern how people read the numbers. A Shapley value answers the question "Why $f(x)$ instead of $E[f(x)]$?", yet we may never actually observe data with the baseline outcome $E[f(x)]$, so the comparison can feel abstract. Shapley values also do not tell you how to *change* an outcome (they are descriptive, not prescriptive), and they are easy to misinterpret as causal when they are not. ::: {.callout-warning} A SHAP value is not a causal effect and not a recipe for action. It explains why this prediction differs from the average prediction, under specific assumptions about feature dependence. When features are correlated, read the attributions with extra care. ::: For further reading on Shapley values and their use in machine learning, see @datta2016, @kumar2020problems, @lipovetsky2001a, @lundberg2018consistent, @lundberg2017unified, @merrick2020explanation, @trumbelj2013, and @sundararajan2020many.