Advanced Data Analysis

Nguyen, Mike

22 Introduction

Every method we have studied so far has shared one feature: we always knew the right answer for at least some of our data. We had a response we cared about, $Y$, and a collection of features, $X$, and our job was to learn the relationship between them so we could predict $Y$ for new observations. That setting is called supervised learning. This chapter opens a different door. We will keep the features $X$ but throw away the response $Y$, and ask a question that sounds almost philosophical at first: with no answer key in hand, what can the data tell us about itself?

This short chapter sets up the rest of the unsupervised part of the book. By the end you will understand what makes a learning problem “unsupervised,” why these problems are both harder and more open-ended than the supervised ones, and the main families of techniques we will use to attack them.

22.1 From Supervised to Unsupervised Learning

Up until now, we have covered supervised learning methods. In that setting we have a set of $n$ observations of $p$ features, $X^T = (X_1, \dots, X_p)$, and for each observation we also observe a response $Y$. The goal is to predict $Y$ from $X$ as accurately as possible.¹

The word “supervised” is more than a label. The response $Y$ acts like a teacher standing over the model’s shoulder: it tells the model when a prediction is right and when it is wrong, usually through some loss function, and we can measure how well the lesson was learned by checking performance with cross-validation or a held-out validation sample.

Key idea

In supervised learning the response $Y$ supervises the model. It defines what “correct” means, which is exactly what lets us tune models and compare them on an honest scale.

Now we shift our attention to unsupervised learning. We still have a set of $n$ observations of $p$ features $X$, but there is no $Y$. Nothing tells the model what the right answer is, because there is no designated right answer. Instead of prediction, our goal is to find structure in these features: groups that hang together, directions along which the data vary the most, or a simpler description of a complicated cloud of points.

Intuition

Supervised learning is like studying with the answer key beside you. Unsupervised learning is like being handed a stack of unlabeled photographs and asked to sort them into meaningful piles. Nobody tells you what the piles should be, and two thoughtful people might sort them differently.

This freedom comes at a price. Because there is no $Y$, there is usually no single number that tells us whether we have done a good job. We cannot compute a test error against a known truth, so judging the result often relies on subject-matter knowledge, visualization, and the usefulness of the structure we uncover.²

Warning

Without a response to validate against, it is easy to “find” structure that is really just noise. Always sanity-check unsupervised results against domain knowledge and, where possible, assess their stability before reading too much into them.

When to use this

Reach for unsupervised learning when you have no labels but suspect the data has hidden organization, when labels are too expensive to collect, or when you want to compress or summarize many features before feeding them into a downstream model.

22.2 What We Will Cover

Two general types of unsupervised learning anchor this part. The first looks for structure among the observations (rows), and the second looks for structure among the features (columns).

Cluster analysis (Chapter 23), i.e., data segmentation, which partitions the observations into groups that are similar within a group and different across groups. This is the tool you want when you suspect your data contains distinct subpopulations, such as customer segments or cell types.
Dimension reduction (Chapter 27), which replaces many correlated features with a smaller set of new features that retain most of the information. This is the tool you want when $p$ is large and you need a compact, often visualizable, summary of the data.

Building on these, this part of the book also takes up scaling (Chapter 31), interpretable machine learning (Chapter 35), and density estimation (Chapter 32).

Tip

These two families are complementary and often used together. A common workflow is to apply dimension reduction first to denoise and shrink the feature space, then cluster in that reduced space.

With the distinction between supervised and unsupervised learning in hand, and a map of the two families ahead of us, we are ready to start with clustering.

Regression and classification are both supervised: in regression $Y$ is numeric, in classification $Y$ is a label. What unites them is that a true $Y$ exists for the training data.↩︎
This is why unsupervised methods are sometimes called exploratory: they are frequently a first step that suggests hypotheses or features, rather than a final answer in themselves.↩︎

# Introduction {#sec-intro-unsupervised} ```{r} #| include: false source("_common.R") ``` Every method we have studied so far has shared one feature: we always knew the right answer for at least some of our data. We had a response we cared about, $Y$, and a collection of features, $X$, and our job was to learn the relationship between them so we could predict $Y$ for new observations. That setting is called supervised learning. This chapter opens a different door. We will keep the features $X$ but throw away the response $Y$, and ask a question that sounds almost philosophical at first: with no answer key in hand, what can the data tell us about itself? This short chapter sets up the rest of the unsupervised part of the book. By the end you will understand what makes a learning problem "unsupervised," why these problems are both harder and more open-ended than the supervised ones, and the main families of techniques we will use to attack them. ## From Supervised to Unsupervised Learning Up until now, we have covered supervised learning methods. In that setting we have a set of $n$ observations of $p$ features, $X^T = (X_1, \dots, X_p)$, and for each observation we also observe a response $Y$. The goal is to predict $Y$ from $X$ as accurately as possible.^[Regression and classification are both supervised: in regression $Y$ is numeric, in classification $Y$ is a label. What unites them is that a true $Y$ exists for the training data.] The word "supervised" is more than a label. The response $Y$ acts like a teacher standing over the model's shoulder: it tells the model when a prediction is right and when it is wrong, usually through some loss function, and we can measure how well the lesson was learned by checking performance with cross-validation or a held-out validation sample. ::: {.callout-important title="Key idea"} In supervised learning the response $Y$ supervises the model. It defines what "correct" means, which is exactly what lets us tune models and compare them on an honest scale. ::: Now we shift our attention to unsupervised learning. We still have a set of $n$ observations of $p$ features $X$, but there is no $Y$. Nothing tells the model what the right answer is, because there is no designated right answer. Instead of prediction, our goal is to find structure in these features: groups that hang together, directions along which the data vary the most, or a simpler description of a complicated cloud of points. ::: {.callout-tip title="Intuition"} Supervised learning is like studying with the answer key beside you. Unsupervised learning is like being handed a stack of unlabeled photographs and asked to sort them into meaningful piles. Nobody tells you what the piles should be, and two thoughtful people might sort them differently. ::: This freedom comes at a price. Because there is no $Y$, there is usually no single number that tells us whether we have done a good job. We cannot compute a test error against a known truth, so judging the result often relies on subject-matter knowledge, visualization, and the usefulness of the structure we uncover.^[This is why unsupervised methods are sometimes called exploratory: they are frequently a first step that suggests hypotheses or features, rather than a final answer in themselves.] ::: {.callout-warning} Without a response to validate against, it is easy to "find" structure that is really just noise. Always sanity-check unsupervised results against domain knowledge and, where possible, assess their stability before reading too much into them. ::: ::: {.callout-tip title="When to use this"} Reach for unsupervised learning when you have no labels but suspect the data has hidden organization, when labels are too expensive to collect, or when you want to compress or summarize many features before feeding them into a downstream model. ::: ## What We Will Cover Two general types of unsupervised learning anchor this part. The first looks for structure among the *observations* (rows), and the second looks for structure among the *features* (columns). 1. Cluster analysis (@sec-cluster), i.e., data segmentation, which partitions the observations into groups that are similar within a group and different across groups. This is the tool you want when you suspect your data contains distinct subpopulations, such as customer segments or cell types. 2. Dimension reduction (@sec-dimension-reduction), which replaces many correlated features with a smaller set of new features that retain most of the information. This is the tool you want when $p$ is large and you need a compact, often visualizable, summary of the data. Building on these, this part of the book also takes up scaling (@sec-scaling), interpretable machine learning (@sec-interpretable-machine-learning), and density estimation (@sec-density-estimation). ::: {.callout-tip} These two families are complementary and often used together. A common workflow is to apply dimension reduction first to denoise and shrink the feature space, then cluster in that reduced space. ::: With the distinction between supervised and unsupervised learning in hand, and a map of the two families ahead of us, we are ready to start with clustering.