Most of the methods in this book expect their inputs as numbers: a row of predictors, a design matrix, a feature vector. Text does not arrive in that form. A product review, a clinical note, or a news headline is just a string of characters, and before any classifier or regressor can use it we need a way to turn that string into numbers that capture what it means. This chapter is about one of the most effective tools for doing exactly that.
BERT (Bidirectional Encoder Representations from Transformers) is a family of pretrained language models that produce contextual representations of text. Introduced by Google in 2018, BERT became one of the foundational tools for modern natural language processing (NLP)1 and remains a workhorse for prediction tasks where the input is text: document classification, sentiment analysis, named-entity recognition, and question answering.
For a prediction-focused course, the key idea is this: BERT lets you turn raw text into high-quality numeric features (embeddings) that you can feed into a downstream classifier or regressor, or fine-tune the whole network end-to-end for your specific task. Either way, you inherit knowledge learned from a massive text corpus without having to train a language model yourself.
Key idea
BERT is a reusable “text-to-numbers” engine. Someone else paid the enormous cost of teaching it language; you reuse the result, either as off-the-shelf features or as a starting point you nudge toward your own task.
By the end of this chapter you will understand what makes BERT’s representations “contextual,” the core mechanics of the Transformer encoder it is built from, how it is trained before you ever see it, and how to put it to work from both R and Python. We begin with the idea that motivated the whole approach: the difference between a fixed word vector and one that pays attention to context.
39.1 From Static to Contextual Embeddings
Earlier word-embedding methods such as word2vec and GloVe map each word type to a single fixed vector (Chapter 110). The word “bank” gets one vector regardless of whether the sentence is about a river bank or a savings bank. These are static embeddings: useful, but blind to context.
BERT instead produces contextual embeddings. The vector it assigns to a token depends on the entire surrounding sentence, so “bank” in “river bank” and “bank” in “deposit money in the bank” receive different representations. This sensitivity to context is what makes BERT dramatically more powerful than static embeddings on tasks that hinge on meaning in context.
The word bidirectional in the name is central. Earlier contextual models (e.g., left-to-right language models) read text in one direction. BERT conditions each token’s representation on words to both its left and its right simultaneously, giving a fuller picture of context.
Intuition
To guess the missing word in “I deposited my check at the ___,” reading only left-to-right (“I deposited my check at the”) already points strongly toward “bank.” But for “the ___ overflowed after three days of rain,” you need the words after the blank to tell that this is a river, not a vault. Seeing both sides at once is what lets BERT resolve meaning that a one-directional reader would miss.
Now that we know what BERT produces, let us look at the machinery that produces it.
39.2 The Transformer Encoder
BERT is built entirely from the encoder half of the Transformer architecture (Chapter 38). You do not need the full mathematical machinery to use BERT, but the following intuitions are worth carrying.
39.2.1 Self-Attention
The core operation is self-attention. For each token, the model computes how much it should “attend to” every other token in the sequence, then forms a weighted combination of their representations. Concretely, each token is projected into a query (\(Q\)), a key (\(K\)), and a value (\(V\)) vector, and attention weights are computed as a scaled dot product:
\[
\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V
\]
where \(d_k\) is the dimension of the key vectors and the \(\sqrt{d_k}\) term keeps the dot products from growing too large. The softmax turns the similarities between a query and all keys into a probability distribution; the output is the correspondingly weighted average of the values. The practical effect: a token can pull in information from any other token in the sentence, no matter how far away, in a single step.
Intuition
A useful mental model is a small lookup or search. Each token issues a query (“which other words should I be paying attention to?”), every token advertises a key (“here is what I am about”), and the dot product between query and key measures how well they match. The softmax converts those match scores into weights that sum to one, and the token then collects a weighted mix of the values. The word “it” can thereby reach back and grab information from the noun it refers to, even if that noun sat ten words earlier.
39.2.1.1 Formulation
Fix notation. Let the input sequence have \(n\) tokens, each embedded as a row vector in \(\mathbb{R}^{d_{\text{model}}}\), and collect them as the rows of \(X \in \mathbb{R}^{n \times d_{\text{model}}}\). A single attention head is defined by three learned projection matrices \(W_Q \in \mathbb{R}^{d_{\text{model}} \times d_k}\), \(W_K \in \mathbb{R}^{d_{\text{model}} \times d_k}\), and \(W_V \in \mathbb{R}^{d_{\text{model}} \times d_v}\), which produce the query, key, and value matrices
\[
Q = X W_Q, \qquad K = X W_K, \qquad V = X W_V .
\tag{39.1}\]
The head output is then
\[
\text{Attention}(Q,K,V)
= \underbrace{\operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)}_{A \,\in\, \mathbb{R}^{n\times n}} V ,
\tag{39.2}\]
where the softmax is applied independently to each row, so each row of the attention matrix \(A\) is a probability vector: \(A_{ij} \ge 0\) and \(\sum_{j=1}^{n} A_{ij} = 1\). Writing this row by row makes the operation transparent. With \(q_i = x_i W_Q\) the query for token \(i\) and \(k_j = x_j W_K\) the key for token \(j\), the weight token \(i\) places on token \(j\) is
So token \(i\)’s new representation is a convex combination of all value vectors, with the weights determined by query-key compatibility. Because every \(A_{ij}\) is nonzero, information can flow between any pair of positions in a single layer; this is the precise sense in which attention has a “receptive field” spanning the whole sequence, in contrast to the local receptive field of a convolution or the sequentially decaying memory of a recurrent network.
39.2.1.2 Why divide by \(\sqrt{d_k}\)
The scaling factor is not cosmetic; it controls the variance of the logits and keeps the softmax in its well-conditioned regime. Suppose the entries of \(q_i\) and \(k_j\) are independent, mean-zero, unit-variance random variables (a reasonable model at initialization). The unscaled logit is \(q_i^\top k_j = \sum_{m=1}^{d_k} q_{im} k_{jm}\). Each summand has mean \(\mathbb{E}[q_{im} k_{jm}] = 0\) and variance \(\operatorname{Var}(q_{im}k_{jm}) = \mathbb{E}[q_{im}^2]\,\mathbb{E}[k_{jm}^2] = 1\) by independence, so
The logits therefore have standard deviation \(\sqrt{d_k}\), which grows with the head dimension. Dividing by \(\sqrt{d_k}\) rescales them back to unit variance. This matters because the softmax saturates: if one logit is much larger than the rest, the output approaches a one-hot vector and the Jacobian of the softmax collapses toward zero. Concretely, for \(p = \operatorname{softmax}(z)\),
which vanishes whenever any \(p_i \to 0\) or \(p_i \to 1\). Letting the logits scale as \(\sqrt{d_k}\) would push the softmax into these flat regions and starve the upstream projections of gradient. The \(1/\sqrt{d_k}\) factor is exactly the normalization that holds the logit variance at \(O(1)\) regardless of \(d_k\).
We can confirm Equation 39.4, that the variance of the unscaled dot product equals \(d_k\) and the scaled version has unit variance, with a quick simulation in base R.
The first row tracks \(d_k\) almost exactly, while the second row stays near one, which is the whole point of the scaling.
39.2.2 Multi-Head Attention and Positional Encodings
A single round of attention is powerful but limited: it produces one pattern of “who attends to whom.” Two refinements address that limitation and a structural blind spot of attention.
Multi-head attention. Rather than computing attention once, the model runs several attention “heads” in parallel, each with its own learned projections. Different heads can specialize, e.g., one tracking syntactic dependencies, another tracking coreference. Their outputs are concatenated and recombined.
Positional encodings. Self-attention is permutation-invariant: on its own it has no notion of word order. BERT adds (learned) positional embeddings to the token embeddings so the model knows which token is first, second, and so on.
Stacking many such attention-plus-feed-forward layers (12 layers in BERT-base, 24 in BERT-large) yields a deep network whose hidden states are the contextual representations we use downstream.
39.2.2.1 Multi-head attention, formally
With \(h\) heads, each head \(r = 1,\dots,h\) has its own projections \(W_Q^{(r)}, W_K^{(r)}, W_V^{(r)}\) and computes \(\text{head}_r = \text{Attention}(XW_Q^{(r)}, XW_K^{(r)}, XW_V^{(r)})
\in \mathbb{R}^{n \times d_v}\). The heads are concatenated along the feature axis and mixed by an output projection \(W_O \in \mathbb{R}^{h d_v \times d_{\text{model}}}\),
In BERT the per-head dimensions are tied to the model width by \(d_k = d_v = d_{\text{model}}/h\) (for BERT-base, \(d_{\text{model}}=768\), \(h=12\), so \(d_k = 64\)). This keeps the total computation and parameter count of multi-head attention essentially equal to that of a single full-width head while allowing \(h\) distinct attention patterns. The reason a single head is insufficient is structural: each head produces a single \(n \times n\) stochastic matrix \(A\), and a convex average through one such matrix can emphasize only one relational pattern at a time. Multiple heads let the layer attend to several relations (syntactic head, coreferent antecedent, adjacent token) in parallel before \(W_O\) recombines them.
39.2.2.2 The full encoder block
Each BERT layer wraps multi-head attention and a position-wise feed-forward network in residual connections with layer normalization. Writing the sublayer computation as
where the feed-forward network is applied to each position independently with \(W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}\) and \(W_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}\) (BERT-base uses \(d_{\text{ff}} = 4 d_{\text{model}} = 3072\)). The residual connections \(X + (\cdot)\) are what make a stack of 12 to 24 such blocks trainable: the identity path gives gradients an unobstructed route back to early layers, so the gradient of the loss with respect to an early hidden state always retains an additive term of magnitude one, mitigating vanishing gradients.
39.2.2.3 Computational complexity
Self-attention’s cost is dominated by forming \(QK^\top\) and applying \(A\) to \(V\), each \(O(n^2 d)\) in time, plus the \(O(n^2)\) memory needed to store the attention matrix. The feed-forward sublayer costs \(O(n\, d\, d_{\text{ff}})\). Per layer the total is
\[
O\big(n^2 d + n\, d\, d_{\text{ff}}\big).
\tag{39.8}\]
The quadratic dependence on sequence length \(n\) is the defining scaling property (and limitation) of the Transformer: it is the reason BERT caps inputs at 512 tokens and the motivation for the many “efficient attention” variants (Longformer, BigBird, Performer) that replace the dense \(n \times n\) matrix with sparse or low-rank approximations.
Note
You do not have to memorize these mechanics to use BERT effectively. What matters for practice is the consequence: every token’s output vector is a context-aware summary that has already had the chance to look at every other token in the input.
39.3 Pretraining Objectives
An architecture alone learns nothing; the network only becomes useful after it is trained. BERT is pretrained on a large unlabeled corpus (originally English Wikipedia and the BooksCorpus) using two self-supervised objectives (Chapter 49). “Self-supervised” means the labels come from the text itself, so no manual annotation is required.
Why this matters
Labeled text is expensive, but raw text is essentially free and almost unlimited. Self-supervision lets BERT learn from billions of words without anyone hand-labeling a single one, which is precisely why the pretrained model carries so much general linguistic knowledge.
39.3.1 Masked Language Modeling (MLM)
A random subset (roughly 15%) of input tokens is replaced with a special [MASK] token, and the model is trained to predict the original tokens from context. Because it must use both left and right context to fill in the blanks, MLM is what gives BERT its deep bidirectional understanding. This is analogous to a cloze (“fill-in-the-blank”) test.
39.3.1.1 The MLM objective, precisely
Let \(x = (x_1,\dots,x_n)\) be the token sequence and let \(\mathcal{M} \subset \{1,\dots,n\}\) be the randomly chosen set of masked positions (\(|\mathcal{M}| \approx 0.15\, n\)). Write \(x_{\mathcal{M}}\) for the masked tokens and \(x_{\setminus \mathcal{M}}\) for the corrupted context the model sees. The encoder maps the (corrupted) input to final hidden states \(H = (h_1,\dots,h_n)\) with \(h_i \in \mathbb{R}^{d_{\text{model}}}\), and a prediction head turns each masked position’s hidden state into a distribution over the vocabulary \(\mathcal{V}\) by a softmax,
where \(g\) is a small transform (a dense layer plus GELU plus LayerNorm) and \(W_{\text{emb}} \in \mathbb{R}^{|\mathcal{V}| \times d_{\text{model}}}\) is tied to the input embedding matrix. Training minimizes the average negative log-likelihood over masked positions, which is precisely the cross-entropy between the one-hot truth and the predicted distribution,
This is the key formal distinction from a left-to-right language model. A standard autoregressive model factorizes \(p(x) = \prod_{i} p(x_i \mid x_{<i})\) and conditions only on the left context. MLM instead estimates the conditionals \(p(x_i \mid x_{\setminus \mathcal{M}})\) given context on both sides, which is what licenses the term “bidirectional.” The price is that MLM does not define a proper joint distribution over \(x\) (the conditionals need not be compatible with any single joint), so BERT is a representation learner and a denoiser rather than a generative model you can sample from.
The 80/10/10 masking rule
A subtlety: the [MASK] token appears during pretraining but never at fine-tuning time, which would create a train/test mismatch. BERT’s remedy is that of the 15% selected positions, 80% are replaced with [MASK], 10% are replaced with a random vocabulary token, and 10% are left unchanged. The loss in Equation 39.10 is still computed at all selected positions. Leaving some tokens unchanged forces the model to build a useful representation for every position (it cannot tell which unchanged tokens it will be scored on), and the random replacements make the model robust rather than reliant on the literal [MASK] symbol.
39.3.2 Next Sentence Prediction (NSP)
BERT is also given pairs of sentences and trained to predict whether the second sentence actually followed the first in the original text or was a random distractor. NSP was intended to help with tasks that involve relationships between sentences (e.g., question answering, natural-language inference). Later work (notably RoBERTa) found NSP contributes little, and many successors drop it. It is still worth knowing as part of the original recipe.
Formally, NSP is a binary classification on the [CLS] hidden state \(h_{\texttt{[CLS]}}\), with \(y = 1\) if the second segment is the true continuation and \(y = 0\) if it was sampled at random,
and BERT is pretrained on the sum \(\mathcal{L} = \mathcal{L}_{\text{MLM}} + \mathcal{L}_{\text{NSP}}\). The reason NSP turned out to be weak is that distinguishing a random sentence from the true next sentence is dominated by topic mismatch, a signal already abundant in MLM, so NSP adds little beyond what masked prediction teaches. RoBERTa drops it; ALBERT replaces it with sentence-order prediction (predicting whether two consecutive segments have been swapped), a harder task that removes the topic shortcut and is more useful.
39.4 Tokenization and Special Tokens
BERT does not operate on whole words. It uses WordPiece tokenization, which splits text into subword units drawn from a fixed vocabulary. Rare or unseen words are broken into known pieces (e.g., “playing” \(\rightarrow\) “play” + “##ing”). This keeps the vocabulary manageable while avoiding out-of-vocabulary tokens entirely.
Beyond the ordinary subword tokens, two special tokens do bookkeeping that you will see referenced constantly in code and tutorials:
[CLS] is prepended to every input. Its final-layer hidden state is used as an aggregate, sentence-level representation for classification tasks.
[SEP] separates segments (e.g., the two sentences in a pair, or a question and a passage).
Tip
When you see code grab “index 0” of the hidden states (as in the Python example below), that is the [CLS] vector being used as a single fixed-length summary of the whole input. It is the most common way to reduce a variable-length sentence to one feature vector.
39.5 Pretrain then Fine-Tune
The defining paradigm is pretrain \(\rightarrow\) fine-tune:
Pretrain once, expensively, on a huge corpus (done by the model’s authors; you download the result).
Fine-tune cheaply on your labeled task by adding a small task-specific head (often a single linear layer on top of the [CLS] representation) and training the whole network for a few epochs on your data.
39.5.0.1 The fine-tuning objective
For a sequence-classification task with \(C\) classes and labeled data \(\{(x^{(t)}, y^{(t)})\}_{t=1}^{N}\), fine-tuning adds a single linear head on the [CLS] representation and minimizes the regularized cross-entropy
where \(h_{\texttt{[CLS]}}(x;\theta)\) now depends on the full encoder parameters \(\theta\), which are themselves updated (not frozen). The crucial point is the starting condition: rather than initializing \(\theta\) at random, fine-tuning initializes it at the pretrained weights \(\theta_{\text{pre}}\) and runs only a few epochs of gradient descent. The optimization therefore stays in the basin around \(\theta_{\text{pre}}\), which already encodes general linguistic structure, and the task head \(W\) only has to learn a linear read-out of that structure.
39.5.0.2 Why transfer works: a bias-variance view
The transfer benefit has a clean statistical reading. Training a Transformer from scratch on a few thousand labeled examples is a high-variance estimation problem: the hypothesis class (hundreds of millions of parameters) is enormous relative to the sample, so the estimator overfits badly. Pretraining acts as an extremely informative, data-driven prior. Initializing at \(\theta_{\text{pre}}\) and taking a small number of steps constrains the effective hypothesis class to a neighborhood of weights that already produce linguistically sensible representations, collapsing the variance at the cost of a small bias (the pretrained features may not be perfectly aligned with the target task). On modest labeled datasets this trade is overwhelmingly favorable, which is exactly the regime where fine-tuned BERT dominates models trained from scratch. As labeled data grows, the variance penalty of training from scratch shrinks and the advantage of transfer narrows, the familiar shape of a learning curve crossover.
This transfer-learning recipe (Chapter 54) is why BERT is so practical: the heavy lifting of learning language is amortized across everyone who reuses the pretrained weights. Fine-tuning typically needs only modest amounts of labeled data and compute.2
When to use this
Reach for the pretrain-then-fine-tune recipe when you have a text-prediction task and only a modest labeled dataset (say, a few hundred to a few tens of thousands of examples). If you have so little labeled data that even fine-tuning overfits, fall back to using BERT purely as a frozen feature extractor and train a simple model on top.
39.6 Downstream Tasks and Why BERT Transfers Well
Once the network is fine-tuned, the same backbone supports several families of prediction tasks; the difference is mainly which hidden states the task head reads and what it predicts from them.
Sequence classification (sentiment, topic, spam): use the [CLS] vector with a classification head.
Token classification / Named-Entity Recognition (NER): attach a classifier to each token’s representation to label spans (person, location, organization).
Question answering (e.g., SQuAD-style extractive QA): predict the start and end positions of the answer span within a passage.
For extractive question answering the head structure is worth making explicit, since it is less obvious than the classification case. Given the token hidden states \(h_1,\dots,h_n\) of the (question, passage) pair, two learned vectors \(s, e \in \mathbb{R}^{d_{\text{model}}}\) define independent softmax distributions over positions for the answer’s start and end,
trained by summing the cross-entropies of the true start and end indices. At inference the predicted span \((\hat\imath, \hat\jmath)\) maximizes \(s^\top h_i + e^\top h_j\) subject to \(i \le j\), computable in \(O(n)\) by a single left-to-right scan that keeps a running prefix maximum of \(s^\top h_i\) (i.e., for each end \(j\), the best start is \(\max_{i\le j} s^\top h_i\)).
BERT transfers well because pretraining forces it to encode broad, reusable linguistic structure (syntax, semantics, some world knowledge) into its hidden states. A downstream task then mostly needs to learn how to read off the relevant parts of that representation rather than learning language from scratch.
39.6.0.1 Failure modes
BERT is not a universal solution, and knowing where it breaks is part of using it responsibly.
Sequence length. The \(O(n^2)\) attention cost and the learned positional embeddings cap the input at 512 tokens. Longer documents must be truncated or chunked, which loses long-range dependencies; for genuinely long inputs use a long-context variant rather than forcing BERT.
Fine-tuning instability. On small datasets, fine-tuning all parameters can be unstable across random seeds, with a non-trivial fraction of runs failing to beat the majority-class baseline. The usual culprits are too high a learning rate and too few warmup steps; remedies are lower learning rates, more warmup, and averaging or selecting over several seeds.
Domain shift. Pretraining on Wikipedia and books leaves BERT weak on text whose vocabulary and style differ sharply (clinical notes, legal contracts, social-media slang), where WordPiece shatters domain terms into many subwords. Domain-adaptive continued pretraining or a domain-specific variant fixes this.
Distribution and spurious cues. Like any empirical-risk minimizer, a fine-tuned BERT will exploit dataset artifacts (annotation shortcuts, lexical overlap heuristics) and can fail under distribution shift even when in-distribution accuracy is high.
39.7 Practical Use in R and Python
With the concepts in place, here is how you actually call BERT. In Python, the standard interface is Hugging Face transformers3, which provides pretrained weights and tokenizers for BERT and its many variants. In R, the text4 package wraps Hugging Face models (via reticulate, which bridges R and Python) so you can obtain embeddings or predictions without leaving R.
Warning
The chunks below are shown with eval = FALSE because they require a Python backend (transformers and PyTorch) and download large model weights on first use. To run them yourself, install the backend once with text::textrpp_install() in R, or pip install transformers torch for the Python version, and expect the first call to fetch several hundred megabytes.
A minimal R sketch using the text package to obtain contextual embeddings:
Show code
# install.packages("text")library(text)# One-time setup: installs the Python backend (transformers, torch) in a# managed conda environment via reticulate.textrpp_install()textrpp_initialize()sentences<-c("The river bank was flooded after the storm.","She deposited her paycheck at the bank.")# Contextual embeddings from a pretrained BERT model.embeddings<-textEmbed( texts =sentences, model ="bert-base-uncased")# `embeddings$texts` holds per-sentence numeric features you can feed# into any downstream model (e.g., glmnet, ranger, xgboost).str(embeddings$texts)
The equivalent in Python, for comparison:
Show code
from transformers import AutoTokenizer, AutoModelimport torchtokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")model = AutoModel.from_pretrained("bert-base-uncased")inputs = tokenizer("She deposited her paycheck at the bank.", return_tensors="pt")with torch.no_grad(): outputs = model(**inputs)# Last hidden state: (batch, tokens, hidden_size).# The [CLS] vector (index 0) is a common sentence-level feature.cls_embedding = outputs.last_hidden_state[:, 0, :]
For an end-to-end classification task, you would instead load AutoModelForSequenceClassification, attach your labels, and fine-tune with the Hugging Face Trainer API or PyTorch directly.
39.7.1 Choosing Hyperparameters and Diagnostics
The original BERT paper recommends a deliberately narrow search space for fine-tuning, and it remains a good default.
Learning rate. Use a small value, typically in \(\{2,3,5\}\times 10^{-5}\) with the AdamW optimizer. Fine-tuning is sensitive here: rates above \(\sim 10^{-4}\) frequently destabilize training because they push \(\theta\) out of the pretrained basin in a single step. Pair the rate with a linear warmup over the first 6 to 10 percent of steps followed by linear decay; the warmup is what prevents the early, high-variance gradients from corrupting the pretrained weights.
Epochs. Two to four epochs are usually enough. More tends to overfit because the model starts from an already-competent initialization, so the marginal labeled signal is small.
Batch size. 16 or 32 are standard; larger batches need a proportionally larger learning rate.
Frozen versus fine-tuned. If labeled data is very scarce or compute is tight, freeze \(\theta\) and train only the head (equivalently, use BERT as the feature extractor of Equation 39.12 with \(\theta\) held at \(\theta_{\text{pre}}\)). This trades a little accuracy for stability and speed.
A practical note on pooling. The [CLS] vector is the conventional sentence summary, but for the frozen-feature use case mean-pooling the token hidden states (averaging \(h_1,\dots,h_n\) over non-padding positions) is often a stronger sentence representation than [CLS], because the raw pretrained [CLS] vector was optimized for NSP rather than for general similarity. When fine-tuning end to end the distinction largely washes out, since the head and encoder co-adapt.
For diagnostics, watch the gap between training and validation loss across the few epochs (overfitting appears quickly at these dataset sizes), and because fine-tuning can be seed-sensitive on small data, run several seeds and report the median rather than the best.
39.8 Variants
The original BERT was a starting point, not a final word. Researchers quickly found that the same architecture could be trained better, made smaller, or specialized to a domain, and the result is a large family of models you will encounter in practice. The most common ones are these.
RoBERTa removes NSP, trains longer on more data with larger batches, and generally outperforms the original BERT.
DistilBERT is a smaller, faster model distilled from BERT, retaining most of the accuracy at a fraction of the size, useful when latency or memory is tight.
Domain-specific BERTs retrain or continue pretraining on specialized corpora, e.g., BioBERT (biomedical), SciBERT (scientific text), FinBERT (finance), and ClinicalBERT (clinical notes).
For most prediction problems, a sensible default is to start with a small pretrained model (e.g., DistilBERT) for fast iteration, then scale up to a larger or domain-specific model only if accuracy demands it.
39.9 Summary
BERT turns text into context-aware numeric features by stacking Transformer encoder layers whose self-attention lets every token attend to every other token. It learns this skill once, through self-supervised pretraining (masked language modeling, and originally next-sentence prediction) on a large unlabeled corpus. You then reuse it in one of two ways: as a frozen feature extractor whose [CLS] vector or token embeddings feed an ordinary downstream model, or by fine-tuning the whole network with a small task head when you have labeled data. The payoff, and the reason BERT earned a place alongside the other predictive methods in this book, is transfer: the expensive part of understanding language is done for you, and your job is reduced to teaching the model how to read off the answer to your particular question.
Natural language processing is the branch of machine learning concerned with getting computers to work with human language: classifying it, extracting information from it, translating it, or generating it.↩︎
The original pretraining ran for days on many specialized accelerators. Fine-tuning a task-specific head, by contrast, is often a matter of minutes to a few hours on a single GPU, and sometimes feasible on a CPU for small datasets.↩︎
# BERT {#sec-bert}```{r}#| include: falsesource("_common.R")```Most of the methods in this book expect their inputs as numbers: a row ofpredictors, a design matrix, a feature vector. Text does not arrive in that form.A product review, a clinical note, or a news headline is just a string ofcharacters, and before any classifier or regressor can use it we need a way toturn that string into numbers that capture what it *means*. This chapter is aboutone of the most effective tools for doing exactly that.BERT (Bidirectional Encoder Representations from Transformers) is a family ofpretrained language models that produce *contextual* representations of text.Introduced by Google in 2018, BERT became one of the foundational tools formodern natural language processing (NLP)^[Natural language processing is thebranch of machine learning concerned with getting computers to work with humanlanguage: classifying it, extracting information from it, translating it, orgenerating it.] and remains a workhorse for prediction tasks where the input istext: document classification, sentiment analysis, named-entity recognition, andquestion answering.For a prediction-focused course, the key idea is this: BERT lets you turn rawtext into high-quality numeric features (embeddings) that you can feed into adownstream classifier or regressor, *or* fine-tune the whole network end-to-endfor your specific task. Either way, you inherit knowledge learned from a massivetext corpus without having to train a language model yourself.::: {.callout-important title="Key idea"}BERT is a reusable "text-to-numbers" engine. Someone else paidthe enormous cost of teaching it language; you reuse the result, either asoff-the-shelf features or as a starting point you nudge toward your own task.:::By the end of this chapter you will understand what makes BERT's representations"contextual," the core mechanics of the Transformer encoder it is built from, howit is trained before you ever see it, and how to put it to work from both R andPython. We begin with the idea that motivated the whole approach: the differencebetween a fixed word vector and one that pays attention to context.## From Static to Contextual EmbeddingsEarlier word-embedding methods such as word2vec and GloVe map each wordtype to a single fixed vector (@sec-embeddings-vector-search). The word "bank" gets *one* vector regardless ofwhether the sentence is about a *river bank* or a *savings bank*. These arestatic embeddings: useful, but blind to context.BERT instead produces contextual embeddings. The vector it assigns to a tokendepends on the entire surrounding sentence, so "bank" in "river bank" and "bank"in "deposit money in the bank" receive different representations. This sensitivityto context is what makes BERT dramatically more powerful than static embeddings ontasks that hinge on meaning in context.The word *bidirectional* in the name is central. Earlier contextual models (e.g.,left-to-right language models) read text in one direction. BERT conditions eachtoken's representation on words to both its left and its rightsimultaneously, giving a fuller picture of context.::: {.callout-tip title="Intuition"}To guess the missing word in "I deposited my check at the\_\_\_," reading only left-to-right ("I deposited my check at the") alreadypoints strongly toward "bank." But for "the \_\_\_ overflowed after three daysof rain," you need the words *after* the blank to tell that this is a river,not a vault. Seeing both sides at once is what lets BERT resolve meaning that aone-directional reader would miss.:::Now that we know *what* BERT produces, let us look at the machinery that producesit.## The Transformer EncoderBERT is built entirely from the encoder half of the Transformer architecture(@sec-transformers). You do not need the full mathematical machinery to use BERT,but the following intuitions are worth carrying.### Self-AttentionThe core operation is self-attention. For each token, the model computes howmuch it should "attend to" every other token in the sequence, then forms aweighted combination of their representations. Concretely, each token is projectedinto a query ($Q$), a key ($K$), and a value ($V$) vector, andattention weights are computed as a scaled dot product:$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$where $d_k$ is the dimension of the key vectors and the $\sqrt{d_k}$ term keeps thedot products from growing too large. The softmax turns the similarities between aquery and all keys into a probability distribution; the output is thecorrespondingly weighted average of the values. The practical effect: a token canpull in information from any other token in the sentence, no matter how far away,in a single step.::: {.callout-tip title="Intuition"}A useful mental model is a small lookup or search. Each tokenissues a query ("which other words should I be paying attention to?"), everytoken advertises a key ("here is what I am about"), and the dot product betweenquery and key measures how well they match. The softmax converts those matchscores into weights that sum to one, and the token then collects a weighted mixof the values. The word "it" can thereby reach back and grab information fromthe noun it refers to, even if that noun sat ten words earlier.:::#### FormulationFix notation. Let the input sequence have $n$ tokens, each embedded as a rowvector in $\mathbb{R}^{d_{\text{model}}}$, and collect them as the rows of$X \in \mathbb{R}^{n \times d_{\text{model}}}$. A single attention head isdefined by three learned projection matrices$W_Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$,$W_K \in \mathbb{R}^{d_{\text{model}} \times d_k}$, and$W_V \in \mathbb{R}^{d_{\text{model}} \times d_v}$, which produce the query, key,and value matrices$$Q = X W_Q, \qquad K = X W_K, \qquad V = X W_V .$$ {#eq-BERT-qkv}The head output is then$$\text{Attention}(Q,K,V)= \underbrace{\operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)}_{A \,\in\, \mathbb{R}^{n\times n}} V ,$$ {#eq-BERT-attn}where the softmax is applied independently to each row, so each row of theattention matrix $A$ is a probability vector: $A_{ij} \ge 0$ and$\sum_{j=1}^{n} A_{ij} = 1$. Writing this row by row makes the operationtransparent. With $q_i = x_i W_Q$ the query for token $i$ and $k_j = x_j W_K$ thekey for token $j$, the weight token $i$ places on token $j$ is$$A_{ij} = \frac{\exp\!\big(q_i^\top k_j / \sqrt{d_k}\big)} {\sum_{\ell=1}^{n} \exp\!\big(q_i^\top k_\ell / \sqrt{d_k}\big)},\qquad\text{output}_i = \sum_{j=1}^{n} A_{ij}\, v_j .$$ {#eq-BERT-attn-row}So token $i$'s new representation is a convex combination of all value vectors,with the weights determined by query-key compatibility. Because every $A_{ij}$ isnonzero, information can flow between any pair of positions in a single layer;this is the precise sense in which attention has a "receptive field" spanning thewhole sequence, in contrast to the local receptive field of a convolution or thesequentially decaying memory of a recurrent network.#### Why divide by $\sqrt{d_k}$The scaling factor is not cosmetic; it controls the variance of the logits andkeeps the softmax in its well-conditioned regime. Suppose the entries of $q_i$and $k_j$ are independent, mean-zero, unit-variance random variables (areasonable model at initialization). The unscaled logit is$q_i^\top k_j = \sum_{m=1}^{d_k} q_{im} k_{jm}$. Each summand has mean$\mathbb{E}[q_{im} k_{jm}] = 0$ and variance$\operatorname{Var}(q_{im}k_{jm}) = \mathbb{E}[q_{im}^2]\,\mathbb{E}[k_{jm}^2] = 1$by independence, so$$\mathbb{E}\big[q_i^\top k_j\big] = 0,\qquad\operatorname{Var}\big(q_i^\top k_j\big) = \sum_{m=1}^{d_k} \operatorname{Var}(q_{im}k_{jm}) = d_k .$$ {#eq-BERT-logit-var}The logits therefore have standard deviation $\sqrt{d_k}$, which grows with thehead dimension. Dividing by $\sqrt{d_k}$ rescales them back to unit variance.This matters because the softmax saturates: if one logit is much larger than therest, the output approaches a one-hot vector and the Jacobian of the softmaxcollapses toward zero. Concretely, for $p = \operatorname{softmax}(z)$,$$\frac{\partial p_i}{\partial z_j} = p_i(\delta_{ij} - p_j),$$ {#eq-BERT-softmax-jac}which vanishes whenever any $p_i \to 0$ or $p_i \to 1$. Letting the logits scaleas $\sqrt{d_k}$ would push the softmax into these flat regions and starve theupstream projections of gradient. The $1/\sqrt{d_k}$ factor is exactly thenormalization that holds the logit variance at $O(1)$ regardless of $d_k$.We can confirm @eq-BERT-logit-var, that the variance of the unscaled dotproduct equals $d_k$ and the scaled version has unit variance, with a quicksimulation in base R.```{r}set.seed(1)dot_var <-function(d_k, reps =20000) { vals <-replicate(reps, { q <-rnorm(d_k); k <-rnorm(d_k) # unit-variance, independentsum(q * k) # unscaled logit q^T k })c(unscaled =var(vals), scaled =var(vals /sqrt(d_k)))}sapply(c(8, 64, 512), dot_var) # columns: d_k = 8, 64, 512```The first row tracks $d_k$ almost exactly, while the second row stays near one,which is the whole point of the scaling.### Multi-Head Attention and Positional EncodingsA single round of attention is powerful but limited: it produces one pattern of"who attends to whom." Two refinements address that limitation and a structuralblind spot of attention.- Multi-head attention. Rather than computing attention once, the model runs several attention "heads" in parallel, each with its own learned projections. Different heads can specialize, e.g., one tracking syntactic dependencies, another tracking coreference. Their outputs are concatenated and recombined.- Positional encodings. Self-attention is permutation-invariant: on its own it has no notion of word order. BERT adds (learned) positional embeddings to the token embeddings so the model knows which token is first, second, and so on.Stacking many such attention-plus-feed-forward layers (12 layers in BERT-base, 24in BERT-large) yields a deep network whose hidden states are the contextualrepresentations we use downstream.#### Multi-head attention, formallyWith $h$ heads, each head $r = 1,\dots,h$ has its own projections$W_Q^{(r)}, W_K^{(r)}, W_V^{(r)}$ and computes$\text{head}_r = \text{Attention}(XW_Q^{(r)}, XW_K^{(r)}, XW_V^{(r)})\in \mathbb{R}^{n \times d_v}$. The heads are concatenated along the feature axisand mixed by an output projection$W_O \in \mathbb{R}^{h d_v \times d_{\text{model}}}$,$$\text{MultiHead}(X) = \big[\text{head}_1 \;\Vert\; \cdots \;\Vert\; \text{head}_h\big]\, W_O .$$ {#eq-BERT-multihead}In BERT the per-head dimensions are tied to the model width by$d_k = d_v = d_{\text{model}}/h$ (for BERT-base, $d_{\text{model}}=768$, $h=12$,so $d_k = 64$). This keeps the total computation and parameter count ofmulti-head attention essentially equal to that of a single full-width head whileallowing $h$ distinct attention patterns. The reason a single head isinsufficient is structural: each head produces a single $n \times n$ stochasticmatrix $A$, and a convex average through one such matrix can emphasize only onerelational pattern at a time. Multiple heads let the layer attend to severalrelations (syntactic head, coreferent antecedent, adjacent token) in parallelbefore $W_O$ recombines them.#### The full encoder blockEach BERT layer wraps multi-head attention and a position-wise feed-forwardnetwork in residual connections with layer normalization. Writing thesublayer computation as$$\begin{aligned}Z &= \text{LayerNorm}\big(X + \text{MultiHead}(X)\big), \\\text{FFN}(z) &= \text{GELU}(z W_1 + b_1)\,W_2 + b_2, \\\text{Output} &= \text{LayerNorm}\big(Z + \text{FFN}(Z)\big),\end{aligned}$$ {#eq-BERT-block}where the feed-forward network is applied to each position independently with$W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$ and$W_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}$ (BERT-base uses$d_{\text{ff}} = 4 d_{\text{model}} = 3072$). The residual connections$X + (\cdot)$ are what make a stack of 12 to 24 such blocks trainable: theidentity path gives gradients an unobstructed route back to early layers, so thegradient of the loss with respect to an early hidden state always retains anadditive term of magnitude one, mitigating vanishing gradients.#### Computational complexitySelf-attention's cost is dominated by forming $QK^\top$ and applying $A$ to $V$,each $O(n^2 d)$ in time, plus the $O(n^2)$ memory needed to store the attentionmatrix. The feed-forward sublayer costs $O(n\, d\, d_{\text{ff}})$. Per layer thetotal is$$O\big(n^2 d + n\, d\, d_{\text{ff}}\big).$$ {#eq-BERT-complexity}The quadratic dependence on sequence length $n$ is the defining scaling property(and limitation) of the Transformer: it is the reason BERT caps inputs at 512tokens and the motivation for the many "efficient attention" variants(Longformer, BigBird, Performer) that replace the dense $n \times n$ matrix withsparse or low-rank approximations.::: {.callout-note}You do not have to memorize these mechanics to use BERT effectively.What matters for practice is the consequence: every token's output vector is acontext-aware summary that has already had the chance to look at every othertoken in the input.:::## Pretraining ObjectivesAn architecture alone learns nothing; the network only becomes useful after it istrained. BERT is pretrained on a large unlabeled corpus (originally EnglishWikipedia and the BooksCorpus) using two self-supervised objectives(@sec-self-supervised-learning). "Self-supervised" means the labels come from thetext itself, so no manual annotation is required.::: {.callout-important title="Why this matters"}Labeled text is expensive, but raw text is essentiallyfree and almost unlimited. Self-supervision lets BERT learn from billions ofwords without anyone hand-labeling a single one, which is precisely why thepretrained model carries so much general linguistic knowledge.:::### Masked Language Modeling (MLM)A random subset (roughly 15%) of input tokens is replaced with a special `[MASK]`token, and the model is trained to predict the original tokens from context. Becauseit must use both left and right context to fill in the blanks, MLM is what givesBERT its deep bidirectional understanding. This is analogous to a cloze("fill-in-the-blank") test.#### The MLM objective, preciselyLet $x = (x_1,\dots,x_n)$ be the token sequence and let$\mathcal{M} \subset \{1,\dots,n\}$ be the randomly chosen set of maskedpositions ($|\mathcal{M}| \approx 0.15\, n$). Write $x_{\mathcal{M}}$ for themasked tokens and $x_{\setminus \mathcal{M}}$ for the corrupted context the modelsees. The encoder maps the (corrupted) input to final hidden states$H = (h_1,\dots,h_n)$ with $h_i \in \mathbb{R}^{d_{\text{model}}}$, and aprediction head turns each masked position's hidden state into a distributionover the vocabulary $\mathcal{V}$ by a softmax,$$p_\theta\big(x_i \mid x_{\setminus \mathcal{M}}\big)= \operatorname{softmax}\!\big(W_{\text{emb}}\, g(h_i) + b\big),\qquad i \in \mathcal{M},$$ {#eq-BERT-mlm-head}where $g$ is a small transform (a dense layer plus GELU plus LayerNorm) and$W_{\text{emb}} \in \mathbb{R}^{|\mathcal{V}| \times d_{\text{model}}}$ is tied tothe input embedding matrix. Training minimizes the average negativelog-likelihood over masked positions, which is precisely the cross-entropybetween the one-hot truth and the predicted distribution,$$\mathcal{L}_{\text{MLM}}(\theta)= -\,\mathbb{E}_{x,\,\mathcal{M}}\left[\frac{1}{|\mathcal{M}|}\sum_{i \in \mathcal{M}}\log p_\theta\big(x_i \mid x_{\setminus \mathcal{M}}\big)\right].$$ {#eq-BERT-mlm-loss}This is the key formal distinction from a left-to-right language model. A standardautoregressive model factorizes$p(x) = \prod_{i} p(x_i \mid x_{<i})$ and conditions only on the left context. MLMinstead estimates the conditionals $p(x_i \mid x_{\setminus \mathcal{M}})$ givencontext on both sides, which is what licenses the term "bidirectional." The priceis that MLM does not define a proper joint distribution over $x$ (the conditionalsneed not be compatible with any single joint), so BERT is a representation learnerand a denoiser rather than a generative model you can sample from.::: {.callout-note title="The 80/10/10 masking rule"}A subtlety: the `[MASK]` token appears during pretraining but never atfine-tuning time, which would create a train/test mismatch. BERT's remedy is thatof the 15% selected positions, 80% are replaced with `[MASK]`, 10% are replacedwith a random vocabulary token, and 10% are left unchanged. The loss in@eq-BERT-mlm-loss is still computed at all selected positions. Leaving sometokens unchanged forces the model to build a useful representation for everyposition (it cannot tell which unchanged tokens it will be scored on), and therandom replacements make the model robust rather than reliant on the literal`[MASK]` symbol.:::### Next Sentence Prediction (NSP)BERT is also given pairs of sentences and trained to predict whether the secondsentence actually followed the first in the original text or was a randomdistractor. NSP was intended to help with tasks that involve relationships betweensentences (e.g., question answering, natural-language inference). Later work (notablyRoBERTa) found NSP contributes little, and many successors drop it. It is stillworth knowing as part of the original recipe.Formally, NSP is a binary classification on the `[CLS]` hidden state $h_{\texttt{[CLS]}}$,with $y = 1$ if the second segment is the true continuation and $y = 0$ if it wassampled at random,$$p_\theta(y \mid x) = \sigma\!\big(w^\top h_{\texttt{[CLS]}} + b\big),\qquad\mathcal{L}_{\text{NSP}}(\theta)= -\,\mathbb{E}\big[y \log p_\theta + (1-y)\log(1-p_\theta)\big],$$ {#eq-BERT-nsp-loss}and BERT is pretrained on the sum$\mathcal{L} = \mathcal{L}_{\text{MLM}} + \mathcal{L}_{\text{NSP}}$. The reasonNSP turned out to be weak is that distinguishing a random sentence from the truenext sentence is dominated by topic mismatch, a signal already abundant in MLM, soNSP adds little beyond what masked prediction teaches. RoBERTa drops it; ALBERTreplaces it with sentence-order prediction (predicting whether two consecutivesegments have been swapped), a harder task that removes the topic shortcut and ismore useful.## Tokenization and Special TokensBERT does not operate on whole words. It uses WordPiece tokenization, whichsplits text into subword units drawn from a fixed vocabulary. Rare or unseen wordsare broken into known pieces (e.g., "playing" $\rightarrow$ "play" + "##ing"). Thiskeeps the vocabulary manageable while avoiding out-of-vocabulary tokens entirely.Beyond the ordinary subword tokens, two special tokens do bookkeeping that youwill see referenced constantly in code and tutorials:- `[CLS]` is prepended to every input. Its final-layer hidden state is used as an aggregate, sentence-level representation for classification tasks.- `[SEP]` separates segments (e.g., the two sentences in a pair, or a question and a passage).::: {.callout-tip}When you see code grab "index 0" of the hidden states (as in thePython example below), that is the `[CLS]` vector being used as a singlefixed-length summary of the whole input. It is the most common way to reduce avariable-length sentence to one feature vector.:::## Pretrain then Fine-TuneThe defining paradigm is pretrain $\rightarrow$ fine-tune:1. Pretrain once, expensively, on a huge corpus (done by the model's authors; you download the result).2. Fine-tune cheaply on your labeled task by adding a small task-specific head (often a single linear layer on top of the `[CLS]` representation) and training the whole network for a few epochs on your data.#### The fine-tuning objectiveFor a sequence-classification task with $C$ classes and labeled data$\{(x^{(t)}, y^{(t)})\}_{t=1}^{N}$, fine-tuning adds a single linear head on the`[CLS]` representation and minimizes the regularized cross-entropy$$p_\phi(y \mid x) = \operatorname{softmax}\!\big(W\, h_{\texttt{[CLS]}}(x;\theta) + b\big),\qquad\min_{\theta,\,W,\,b}\;-\frac{1}{N}\sum_{t=1}^{N}\log p_\phi\big(y^{(t)} \mid x^{(t)}\big),$$ {#eq-BERT-finetune}where $h_{\texttt{[CLS]}}(x;\theta)$ now depends on the full encoder parameters$\theta$, which are themselves updated (not frozen). The crucial point is thestarting condition: rather than initializing $\theta$ at random, fine-tuninginitializes it at the pretrained weights $\theta_{\text{pre}}$ and runs only a fewepochs of gradient descent. The optimization therefore stays in the basin around$\theta_{\text{pre}}$, which already encodes general linguistic structure, and thetask head $W$ only has to learn a linear read-out of that structure.#### Why transfer works: a bias-variance viewThe transfer benefit has a clean statistical reading. Training a Transformer fromscratch on a few thousand labeled examples is a high-variance estimation problem:the hypothesis class (hundreds of millions of parameters) is enormous relative tothe sample, so the estimator overfits badly. Pretraining acts as an extremelyinformative, data-driven prior. Initializing at $\theta_{\text{pre}}$ and taking asmall number of steps constrains the effective hypothesis class to a neighborhoodof weights that already produce linguistically sensible representations,collapsing the variance at the cost of a small bias (the pretrained features maynot be perfectly aligned with the target task). On modest labeled datasets thistrade is overwhelmingly favorable, which is exactly the regime where fine-tunedBERT dominates models trained from scratch. As labeled data grows, the variancepenalty of training from scratch shrinks and the advantage of transfer narrows,the familiar shape of a learning curve crossover.This transfer-learning recipe (@sec-transfer-multitask-learning) is why BERT is sopractical: the heavy lifting of learning language is amortized across everyone whoreuses the pretrained weights.Fine-tuning typically needs only modest amounts of labeled data and compute.^[Theoriginal pretraining ran for days on many specialized accelerators. Fine-tuning atask-specific head, by contrast, is often a matter of minutes to a few hours on asingle GPU, and sometimes feasible on a CPU for small datasets.]::: {.callout-tip title="When to use this"}Reach for the pretrain-then-fine-tune recipe when youhave a text-prediction task and only a modest labeled dataset (say, a fewhundred to a few tens of thousands of examples). If you have so little labeleddata that even fine-tuning overfits, fall back to using BERT purely as a frozenfeature extractor and train a simple model on top.:::## Downstream Tasks and Why BERT Transfers WellOnce the network is fine-tuned, the same backbone supports several families ofprediction tasks; the difference is mainly which hidden states the task headreads and what it predicts from them.- Sequence classification (sentiment, topic, spam): use the `[CLS]` vector with a classification head.- Token classification / Named-Entity Recognition (NER): attach a classifier to each token's representation to label spans (person, location, organization).- Question answering (e.g., SQuAD-style extractive QA): predict the start and end positions of the answer span within a passage.For extractive question answering the head structure is worth making explicit,since it is less obvious than the classification case. Given the token hiddenstates $h_1,\dots,h_n$ of the (question, passage) pair, two learned vectors$s, e \in \mathbb{R}^{d_{\text{model}}}$ define independent softmax distributionsover positions for the answer's start and end,$$P_{\text{start}}(i) = \frac{\exp(s^\top h_i)}{\sum_{j} \exp(s^\top h_j)},\qquadP_{\text{end}}(i) = \frac{\exp(e^\top h_i)}{\sum_{j} \exp(e^\top h_j)},$$ {#eq-BERT-qa}trained by summing the cross-entropies of the true start and end indices. Atinference the predicted span $(\hat\imath, \hat\jmath)$ maximizes$s^\top h_i + e^\top h_j$ subject to $i \le j$, computable in $O(n)$ by a single left-to-right scan that keeps a running prefix maximum of $s^\top h_i$ (i.e., for each end $j$, the best start is $\max_{i\le j} s^\top h_i$).BERT transfers well because pretraining forces it to encode broad, reusablelinguistic structure (syntax, semantics, some world knowledge) into its hiddenstates. A downstream task then mostly needs to learn how to *read off* the relevantparts of that representation rather than learning language from scratch.#### Failure modesBERT is not a universal solution, and knowing where it breaks is part of using itresponsibly.- Sequence length. The $O(n^2)$ attention cost and the learned positional embeddings cap the input at 512 tokens. Longer documents must be truncated or chunked, which loses long-range dependencies; for genuinely long inputs use a long-context variant rather than forcing BERT.- Fine-tuning instability. On small datasets, fine-tuning all parameters can be unstable across random seeds, with a non-trivial fraction of runs failing to beat the majority-class baseline. The usual culprits are too high a learning rate and too few warmup steps; remedies are lower learning rates, more warmup, and averaging or selecting over several seeds.- Domain shift. Pretraining on Wikipedia and books leaves BERT weak on text whose vocabulary and style differ sharply (clinical notes, legal contracts, social-media slang), where WordPiece shatters domain terms into many subwords. Domain-adaptive continued pretraining or a domain-specific variant fixes this.- Distribution and spurious cues. Like any empirical-risk minimizer, a fine-tuned BERT will exploit dataset artifacts (annotation shortcuts, lexical overlap heuristics) and can fail under distribution shift even when in-distribution accuracy is high.## Practical Use in R and PythonWith the concepts in place, here is how you actually call BERT. In Python, thestandard interface is Hugging Face`transformers`^[<https://huggingface.co/docs/transformers>], which providespretrained weights and tokenizers for BERT and its many variants. In R, the`text`^[<https://www.r-text.org/>] package wraps Hugging Face models (viareticulate, which bridges R and Python) so you can obtain embeddings orpredictions without leaving R.::: {.callout-warning}The chunks below are shown with `eval = FALSE` because theyrequire a Python backend (transformers and PyTorch) and download large modelweights on first use. To run them yourself, install the backend once with`text::textrpp_install()` in R, or `pip install transformers torch` for thePython version, and expect the first call to fetch several hundred megabytes.:::A minimal R sketch using the `text` package to obtain contextual embeddings:```{r, eval = FALSE}# install.packages("text")library(text)# One-time setup: installs the Python backend (transformers, torch) in a# managed conda environment via reticulate.textrpp_install()textrpp_initialize()sentences <-c("The river bank was flooded after the storm.","She deposited her paycheck at the bank.")# Contextual embeddings from a pretrained BERT model.embeddings <-textEmbed(texts = sentences,model ="bert-base-uncased")# `embeddings$texts` holds per-sentence numeric features you can feed# into any downstream model (e.g., glmnet, ranger, xgboost).str(embeddings$texts)```The equivalent in Python, for comparison:```{python, eval = FALSE}from transformers import AutoTokenizer, AutoModelimport torchtokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")model = AutoModel.from_pretrained("bert-base-uncased")inputs = tokenizer("She deposited her paycheck at the bank.", return_tensors="pt")with torch.no_grad(): outputs = model(**inputs)# Last hidden state: (batch, tokens, hidden_size).# The [CLS] vector (index 0) is a common sentence-level feature.cls_embedding = outputs.last_hidden_state[:, 0, :]```For an end-to-end classification task, you would instead load`AutoModelForSequenceClassification`, attach your labels, and fine-tune with theHugging Face `Trainer` API or PyTorch directly.### Choosing Hyperparameters and DiagnosticsThe original BERT paper recommends a deliberately narrow search space forfine-tuning, and it remains a good default.- Learning rate. Use a small value, typically in $\{2,3,5\}\times 10^{-5}$ with the AdamW optimizer. Fine-tuning is sensitive here: rates above $\sim 10^{-4}$ frequently destabilize training because they push $\theta$ out of the pretrained basin in a single step. Pair the rate with a linear warmup over the first 6 to 10 percent of steps followed by linear decay; the warmup is what prevents the early, high-variance gradients from corrupting the pretrained weights.- Epochs. Two to four epochs are usually enough. More tends to overfit because the model starts from an already-competent initialization, so the marginal labeled signal is small.- Batch size. 16 or 32 are standard; larger batches need a proportionally larger learning rate.- Frozen versus fine-tuned. If labeled data is very scarce or compute is tight, freeze $\theta$ and train only the head (equivalently, use BERT as the feature extractor of @eq-BERT-finetune with $\theta$ held at $\theta_{\text{pre}}$). This trades a little accuracy for stability and speed.A practical note on pooling. The `[CLS]` vector is the conventional sentencesummary, but for the frozen-feature use case mean-pooling the token hidden states(averaging $h_1,\dots,h_n$ over non-padding positions) is often a strongersentence representation than `[CLS]`, because the raw pretrained `[CLS]` vector wasoptimized for NSP rather than for general similarity. When fine-tuning end to endthe distinction largely washes out, since the head and encoder co-adapt.For diagnostics, watch the gap between training and validation loss across the fewepochs (overfitting appears quickly at these dataset sizes), and becausefine-tuning can be seed-sensitive on small data, run several seeds and report themedian rather than the best.## VariantsThe original BERT was a starting point, not a final word. Researchers quicklyfound that the same architecture could be trained better, made smaller, orspecialized to a domain, and the result is a large family of models you willencounter in practice. The most common ones are these.- RoBERTa removes NSP, trains longer on more data with larger batches, and generally outperforms the original BERT.- DistilBERT is a smaller, faster model distilled from BERT, retaining most of the accuracy at a fraction of the size, useful when latency or memory is tight.- Domain-specific BERTs retrain or continue pretraining on specialized corpora, e.g., BioBERT (biomedical), SciBERT (scientific text), FinBERT (finance), and ClinicalBERT (clinical notes).For most prediction problems, a sensible default is to start with a smallpretrained model (e.g., DistilBERT) for fast iteration, then scale up to a largeror domain-specific model only if accuracy demands it.## SummaryBERT turns text into context-aware numeric features by stacking Transformerencoder layers whose self-attention lets every token attend to every other token.It learns this skill once, through self-supervised pretraining (masked languagemodeling, and originally next-sentence prediction) on a large unlabeled corpus.You then reuse it in one of two ways: as a frozen feature extractor whose`[CLS]` vector or token embeddings feed an ordinary downstream model, or byfine-tuning the whole network with a small task head when you have labeled data.The payoff, and the reason BERT earned a place alongside the other predictivemethods in this book, is transfer: the expensive part of understanding languageis done for you, and your job is reduced to teaching the model how to read off theanswer to your particular question.## Further Reading- Official implementation and weights: <https://github.com/google-research/bert>- A gentle tutorial: <https://www.freecodecamp.org/news/google-bert-nlp-machine-learning-tutorial/>- A visual guide to using BERT for the first time:<https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/>- BERT explained, theory and tutorial:<https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/>