Advanced Data Analysis

Nguyen, Mike

78 Learning to Rank

Imagine you type a query into a search box. The system holds millions of documents, and almost all of them are useless to you. Its real job is not to label each document “relevant” or “not relevant” in isolation, but to put the few good ones at the very top of the list, where your eyes actually go. The same problem shows up everywhere: a recommender (see Chapter 89) deciding which ten products to show on a page, an email client deciding which messages float to the top of the inbox, a hospital triage tool deciding which patients to review first. In all of these, the output that matters is an ordering, and the cost of a mistake depends on where in the ordering it happens. Putting the best result in position 1 versus position 2 is a small sin; burying it in position 50 is a large one.

This is the ranking problem, and it is genuinely different from the regression and classification problems that fill most of this book. In ordinary supervised learning we score each example on its own and we are graded on per-example accuracy. In ranking we are graded on the quality of a sorted list, and the grade is dominated by the top of that list. The field that studies how to train models for this objective is called learning to rank (LTR).

Key idea

In ranking, the unit of evaluation is an ordered list (typically the list of documents retrieved for a single query), and errors near the top of the list cost far more than errors near the bottom. Good ranking models optimize for that asymmetry directly.

Three pieces of vocabulary will recur. A query $q$ is the context: a search string, a user, a session. For each query we have a set of candidate items (often called documents) $d_1, \dots, d_n$. Each query-document pair has a relevance label $y$, which may be binary (relevant or not) or graded (for example $0, 1, 2, 3, 4$ from “useless” to “perfect”). The model learns a scoring function $f(\mathbf{x})$ where $\mathbf{x}$ is a feature vector describing one query-document pair, and the final ranking for a query is produced by sorting its candidates by their scores in descending order.

The chapter proceeds as follows. We first make the evaluation problem precise, because in ranking the metric is the problem: precision@k, MAP, MRR, and NDCG. We then survey the three families of training approaches (pointwise, pairwise, listwise) and explain why pairwise and listwise methods exist at all. We work through the core ideas of RankNet and LambdaMART, the two algorithms that shaped modern LTR. Finally we build a runnable base-R demonstration: we implement the metrics from scratch, score several candidate rankings, simulate query-document data, and train a simple pairwise ranker, then compare it against a strong gradient-boosted ranker.

78.1 Why ranking is not just regression

A natural first instinct is to treat ranking as regression: predict each item’s relevance label with squared error, then sort by the prediction. This can work, and it is a legitimate baseline (it is exactly the “pointwise” approach below). But it optimizes the wrong thing in two ways.

First, it spends equal effort everywhere. Squared error cares just as much about getting a document at rank 500 right as one at rank 2, even though no user will ever scroll to rank 500. Second, ranking only depends on the relative order of scores, not their absolute values. A model that predicts relevances $(3.0, 2.0, 1.0)$ and one that predicts $(0.31, 0.30, 0.29)$ produce the identical ranking, yet squared-error training treats them very differently and may waste capacity matching label magnitudes that have no effect on the output.

Intuition

The output of a ranker is invariant to any order-preserving (monotone) transformation of its scores. The loss we train with should respect that invariance: it should reward getting pairs in the right order and getting the top of the list right, not getting numeric labels exactly.

This mismatch between an easy-to-optimize surrogate loss (squared error, log loss) and the true objective (a list-quality metric like NDCG) is the central tension in learning to rank, and every method below is a different way of managing it.

78.1.1 A formal problem statement

Make the setup precise so the later derivations have something to bite on. A query and its candidate set form one training instance. Let $q$ index queries drawn i.i.d. from a query distribution, and for query $q$ let $\mathbf{X}^{(q)} = (\mathbf{x}^{(q)}_1, \dots, \mathbf{x}^{(q)}_{n_q})$ be the feature vectors of its $n_q$ candidates, with relevance labels $\mathbf{y}^{(q)} = (y^{(q)}_1, \dots, y^{(q)}_{n_q})$. A scoring function $f \in \mathcal{F}$, $f : \mathbb{R}^p \to \mathbb{R}$, induces a ranking $\pi_f^{(q)}$ by sorting candidates on $s_i = f(\mathbf{x}^{(q)}_i)$ in descending order (ties broken arbitrarily). Let $M$ be a per-query list metric (NDCG@k, AP, etc.), written $M(\pi, \mathbf{y}) \in [0,1]$ with $M$ larger for better orderings. The learning goal is to maximize the expected metric, equivalently to minimize the ranking risk

\[ R(f) = \mathbb{E}_q\!\left[\, 1 - M\big(\pi_f^{(q)}, \mathbf{y}^{(q)}\big) \right], \tag{78.1}\]

estimated on a sample by the empirical average over the $|Q|$ training queries. The obstruction named above is now explicit: $\pi_f^{(q)}$ depends on $f$ only through the sort order of the scores, so $M(\pi_f^{(q)}, \mathbf{y}^{(q)})$ is piecewise constant in the parameters of $f$. Its gradient is zero wherever it is defined and undefined on the measure-zero set where two scores cross. Direct gradient optimization of Equation 78.1 is therefore impossible, and the three families correspond to three classes of surrogate $\tilde{R}(f) \ge \text{const} \cdot R(f)$ that are differentiable in $f$.

Note

The sort step is the source of all the difficulty. Everything that distinguishes ranking from regression flows from the fact that the map from scores to the evaluated quantity factors through a discontinuous $\operatorname{argsort}$. Surrogates differ in how they smooth or sidestep that discontinuity: pairwise losses replace the global sort with many local two-item comparisons, listwise losses replace it with a differentiable distribution over permutations, and LambdaRank keeps the discontinuous metric but supplies hand-built gradients.

78.2 Ranking metrics

We need to measure list quality before we can train for it. Throughout, fix a single query with $n$ candidate items. The model produces a ranking, which we read as an ordered list of positions $1, 2, \dots, n$ from top to bottom. Let $\text{rel}_i$ denote the relevance label of the item placed at position $i$. We will average any per-query metric over all queries in the evaluation set at the end.

78.2.1 Precision@k and recall@k

For binary relevance (each item is relevant, $\text{rel}_i \in \{0,1\}$), the simplest useful metric looks only at the top $k$ positions:

\[ \text{Precision@}k = \frac{1}{k} \sum_{i=1}^{k} \text{rel}_i , \]

the fraction of the top $k$ items that are actually relevant. Recall@k instead divides by the total number of relevant items for the query, measuring how many of the good items we managed to surface in the top $k$. Precision@k is intuitive and matches how people consume a results page, but it has a blind spot: it ignores order within the top $k$. A relevant item at position 1 and at position $k$ count the same. The next metrics fix that.

78.2.2 Mean reciprocal rank (MRR)

Some tasks have essentially one right answer per query: a “navigational” search (“facebook login”), or a question with a single correct answer. There, what matters is how far down you had to look to find the first relevant item. Let $\text{rank}_q$ be the position of the first relevant item for query $q$. The reciprocal rank is $1/\text{rank}_q$, and averaging over the set of queries $Q$ gives

\[ \text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q} . \]

If the first relevant result is at position 1 the query scores $1$; at position 2 it scores $0.5$; at position 10 only $0.1$. MRR rewards getting a good answer to the top and ignores everything after the first hit, which is exactly right when there is one answer and wrong when there are many.

78.2.3 Average precision and MAP

When a query has several relevant items, we want credit for surfacing all of them and for ordering them well. Average precision (AP) does this by averaging the precision computed at each position where a relevant item appears:

\[ \text{AP}_q = \frac{1}{R_q} \sum_{k=1}^{n} \text{Precision@}k \cdot \text{rel}_k , \]

where $R_q$ is the total number of relevant items for the query and the factor $\text{rel}_k$ means we only add a term at positions that hold a relevant item. Mean average precision (MAP) is the average of $\text{AP}_q$ over all queries:

\[ \text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}_q . \]

Intuition

AP is high when relevant items are clustered near the top. Each relevant item is scored by the precision of the list up to and including its own position, so a relevant item that appears after a lot of junk earns a low precision and drags the average down.

78.2.4 Normalized discounted cumulative gain (NDCG)

Precision-based metrics assume binary relevance. Real labels are often graded: a result can be “perfect,” “good,” “fair,” or “useless.” NDCG is the workhorse metric for graded relevance, and it makes two ideas explicit, a gain that grows with relevance and a discount that shrinks with position.

Discounted cumulative gain at cutoff $k$ is

\[ \text{DCG@}k = \sum_{i=1}^{k} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)} . \]

The numerator $2^{\text{rel}_i} - 1$ is the gain: exponentiating the relevance means a single “perfect” document is worth much more than several “fair” ones, which matches how users feel about quality. The denominator $\log_2(i+1)$ is the discount: it grows with position, so the same document contributes less the further down it sits. Position 1 has discount $\log_2 2 = 1$ (no discount), position 2 has $\log_2 3 \approx 1.585$, and so on.

DCG is not comparable across queries because queries differ in how many relevant items they have and how relevant those items are. We normalize by the best possible DCG for that query, the ideal DCG (IDCG@k), obtained by sorting the items by true relevance descending:

\[ \text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k} . \]

NDCG@k lies in $[0, 1]$, equals $1$ for the ideal ordering, and averages cleanly over queries. It is the metric most LTR systems report and, indirectly, the metric LambdaMART tries to optimize.

Note

There are two slightly different DCG conventions in the literature. The one above (with the $2^{\text{rel}}-1$ gain) is the form used by most modern toolkits and competitions; an older form uses $\text{rel}_i$ directly in the numerator. They agree for binary relevance up to scaling. We use the exponential-gain form throughout.

Table 78.1 summarizes when each metric is the right choice.

Table 78.1: Ranking metrics and the situations they fit. Precision@k is the simplest; NDCG@k is the most general because it handles graded relevance and discounts by position.

Metric	Relevance type	Rewards order in top-k	Best used when
Precision@k	Binary	No	You only care how many of the top k are good
MRR	Binary	First hit only	There is one right answer per query (navigational search, QA)
MAP	Binary	Yes	Several relevant items per query, all worth finding
NDCG@k	Graded or binary	Yes	Graded relevance and position both matter (web search, recsys)

Table 78.1 shows that the metrics form a rough ladder of generality, from precision@k (binary, order-insensitive within the cutoff) up to NDCG@k (graded relevance with an explicit position discount).

78.3 Three families of training approaches

Given a metric, how do we train a scoring function $f(\mathbf{x})$ to do well on it? The metrics above are step functions of the ranking: they change only when two items swap order, so they are flat almost everywhere and have zero gradient. We cannot optimize them directly with gradient descent. The three families of LTR methods are three answers to “what differentiable surrogate do we optimize instead?”

Pointwise. Treat each query-document pair independently and predict its relevance label with ordinary regression or classification. The loss is a sum over individual pairs, for example squared error $\sum (f(\mathbf{x}) - y)^2$. Then sort by predicted score. This is the simplest approach and reuses every regression or classification tool you already know, but as argued above it optimizes label magnitudes rather than order, and it weights all positions equally.

Pairwise. Look at pairs of documents within the same query and learn to get their relative order right. If document $a$ is more relevant than document $b$, we want $f(\mathbf{x}_a) > f(\mathbf{x}_b)$. The loss penalizes pairs that are ordered incorrectly. This turns ranking into a binary classification problem over pairs (“is $a$ better than $b$?”) and aligns much better with the goal, since ranking is about relative order. RankNet, RankSVM, and the original gradient-boosted rankers are pairwise.

Listwise. Define a loss over the entire ranked list for a query at once, ideally a smooth surrogate for the actual metric (NDCG). This is the most direct but also the most complex. ListNet, ListMLE, and the NDCG-aware variant of LambdaMART live here.

When to use this

Start pointwise for a baseline and when labels are reliable cardinal values. Move to pairwise (the usual sweet spot) when you have many items per query and care about order. Reach for listwise when you are squeezing out the last few points of NDCG and can afford the complexity. In practice, gradient-boosted pairwise/listwise rankers (LambdaMART) are the default production choice for tabular ranking features.

Table 78.2 lays out the tradeoffs.

Table 78.2: The three families of learning-to-rank methods. Pairwise is the common default; listwise is the most direct optimizer of list-level metrics.

Approach	What it models	Training signal	Examples	Tradeoff
Pointwise	Score of one item	Item label (regression/classification)	Linear/GBM regression on labels	Simple, but ignores order and position
Pairwise	Relative order of an item pair	Which item in a pair ranks higher	RankNet, RankSVM, GBM pairwise	Aligns with order, scales as pairs (can be many)
Listwise	Quality of the whole list	List-level metric surrogate	ListNet, ListMLE, LambdaMART	Most aligned with metric, most complex

Table 78.2 makes clear that moving down the table buys closer alignment with the true metric at the cost of more machinery.

78.4 RankNet: a pairwise model with a probabilistic loss

RankNet (C. Burges et al. 2005) is the cleanest way to see the pairwise idea, and it underlies the later Lambda methods. The model is any differentiable scoring function $f$ (originally a neural network, see Chapter 15, but it can be linear). For two documents $i$ and $j$ in the same query with scores $s_i = f(\mathbf{x}_i)$ and $s_j = f(\mathbf{x}_j)$, define the modeled probability that $i$ should rank above $j$ using a logistic (sigmoid) function of the score difference:

\[ P_{ij} = \frac{1}{1 + e^{-\sigma (s_i - s_j)}} , \]

where $\sigma > 0$ is a shape parameter. The larger the score gap in the right direction, the closer $P_{ij}$ is to $1$. Let $\bar{P}_{ij}$ be the target: $1$ if $i$ is truly more relevant than $j$, $0$ if less, $0.5$ if tied. RankNet uses the cross-entropy (log loss) between target and model:

\[ C_{ij} = -\bar{P}_{ij} \log P_{ij} - (1 - \bar{P}_{ij}) \log (1 - P_{ij}) . \]

For a pair where $i$ truly outranks $j$ ($\bar{P}_{ij} = 1$), this simplifies to $C_{ij} = \log\!\big(1 + e^{-\sigma(s_i - s_j)}\big)$, a smooth, convex penalty that is near zero when $s_i \gg s_j$ and grows linearly when the order is badly wrong. This is exactly a logistic loss applied to score differences rather than raw scores. We sum $C_{ij}$ over all pairs whose order we know and minimize by gradient descent.

Note

Because the loss depends only on $s_i - s_j$, RankNet learns relative order and is naturally invariant to adding a constant to every score. That is the order-invariance property we wanted from a ranking loss.

The gradient of $C_{ij}$ with respect to the score difference has the clean form

\[ \frac{\partial C_{ij}}{\partial s_i} = -\frac{\partial C_{ij}}{\partial s_j} = \sigma \left( \frac{1}{2}(1 - S_{ij}) - \frac{1}{1 + e^{\sigma(s_i - s_j)}} \right), \]

where $S_{ij} \in \{-1, 0, +1\}$ encodes whether $i$ is less, equally, or more relevant than $j$. Define $\lambda_{ij}$ as this gradient. A key observation is that each document’s total gradient is just the sum of $\lambda_{ij}$ over all pairs it participates in. These per-pair forces $\lambda_{ij}$ can be read as little “tugs”: document $i$ is pulled up and $j$ is pulled down whenever $i$ should outrank $j$ but does not by enough.

78.4.1 Deriving the RankNet gradient

It is worth doing the calculus once, because the result is the engine of every Lambda method. Adopt the convention $S_{ij} \in \{-1,0,+1\}$ and write the target as $\bar{P}_{ij} = \tfrac{1}{2}(1 + S_{ij})$, so $S_{ij}=+1$ gives $\bar{P}_{ij}=1$, a tie gives $\tfrac12$, and $S_{ij}=-1$ gives $0$. Let $s_{ij} = s_i - s_j$. The model probability is $P_{ij} = \sigma(\sigma_0 s_{ij})$ where $\sigma(z) = 1/(1+e^{-z})$ and $\sigma_0>0$ is the shape parameter (called $\sigma$ in the text above). Substituting $\bar{P}_{ij}$ into the cross-entropy and using $1 - \sigma(z) = \sigma(-z)$,

\[ C_{ij} = -\tfrac{1}{2}(1+S_{ij}) \log \sigma(\sigma_0 s_{ij}) - \tfrac{1}{2}(1 - S_{ij}) \log \sigma(-\sigma_0 s_{ij}). \]

Using the textbook identity $\log \sigma(z) = -\log(1 + e^{-z})$ and collecting terms,

\[ C_{ij} = \tfrac{1}{2}(1 - S_{ij})\,\sigma_0 s_{ij} + \log\!\big(1 + e^{-\sigma_0 s_{ij}}\big). \tag{78.2}\]

The first term is linear, the second is the softplus. For the canonical case $S_{ij}=+1$ the linear term vanishes and Equation 78.2 reduces to $\log(1 + e^{-\sigma_0 s_{ij}})$, matching the simplified form stated earlier. Now differentiate. Since $\frac{d}{dz}\log(1+e^{-z}) = -\sigma(-z) = -(1-\sigma(z))$ and $\partial s_{ij}/\partial s_i = 1$,

\[ \frac{\partial C_{ij}}{\partial s_i} = \sigma_0\!\left[\tfrac{1}{2}(1 - S_{ij}) - \big(1 - \sigma(\sigma_0 s_{ij})\big)\right] = \sigma_0\!\left[\tfrac{1}{2}(1 - S_{ij}) - \frac{1}{1 + e^{\sigma_0 s_{ij}}}\right], \tag{78.3}\]

which is exactly $\lambda_{ij}$ as stated, and $\partial C_{ij}/\partial s_j = -\partial C_{ij}/\partial s_i$ because $C_{ij}$ depends on $s_i, s_j$ only through $s_{ij}$. The antisymmetry $\lambda_{ji} = -\lambda_{ij}$ is what lets us aggregate: with $\mathcal{P}$ the set of pairs $(i,j)$ with $i$ more relevant than $j$ (so $S_{ij}=+1$ throughout, the standard bookkeeping), the gradient on a single document accumulates the tugs from every pair it touches,

\[ \frac{\partial C}{\partial s_i} = \sum_{j : (i,j) \in \mathcal{P}} \lambda_{ij} \;-\; \sum_{j : (j,i) \in \mathcal{P}} \lambda_{ji} \;\equiv\; \lambda_i, \tag{78.4}\]

and the chain rule gives $\partial C / \partial \mathbf{w} = \sum_i \lambda_i \, \partial s_i / \partial \mathbf{w}$. For the linear scorer $s_i = \mathbf{w}^\top \mathbf{x}_i$ this is $\sum_{(i,j) \in \mathcal{P}} \lambda_{ij} (\mathbf{x}_i - \mathbf{x}_j)$, the update implemented in the demonstration. The reduction from $O(n^2)$ per-pair gradients to $n$ per-document gradients via Equation 78.4 is the factorization trick that makes the method affordable inside a tree learner.

Convexity

For a linear scorer, Equation 78.2 is convex in $\mathbf{w}$: it is a sum of an affine term and the softplus $\log(1+e^{-\sigma_0 \mathbf{w}^\top \mathbf{d}_{ij}})$ composed with the affine map $\mathbf{w} \mapsto \mathbf{w}^\top \mathbf{d}_{ij}$, and softplus is convex. The pairwise RankNet problem over a linear class therefore has no spurious local minima, which is why the simple gradient descent in the demonstration converges reliably. Convexity is lost once $f$ is a neural network or a tree ensemble.

78.5 From RankNet to LambdaMART

RankNet minimizes pairwise error, but pairwise error is not NDCG. Swapping two documents near the bottom of the list barely changes NDCG, while swapping two near the top changes it a lot. Plain RankNet treats both swaps as equally important. LambdaRank (C. J. C. Burges 2010) fixes this with a remarkably simple trick: scale each pairwise tug $\lambda_{ij}$ by the change in NDCG that would result from swapping documents $i$ and $j$:

\[ \lambda_{ij}^{\text{Lambda}} = \lambda_{ij} \cdot \left| \Delta \text{NDCG}_{ij} \right| . \]

The model never differentiates NDCG (it cannot, NDCG is flat almost everywhere). Instead it defines the gradients directly, weighting each pairwise force by how much fixing that pair would improve the metric. Pairs that matter for NDCG pull harder. Empirically these “lambda” gradients optimize NDCG very well even though no closed-form loss is being differentiated.

LambdaMART (C. J. C. Burges 2010) is LambdaRank’s gradient signal plugged into gradient-boosted regression trees (MART = Multiple Additive Regression Trees, the boosting machine of Chapter 11) instead of a neural net. At each boosting round it computes the lambda gradients for every document, fits a regression tree to them, and adds that tree to the ensemble. LambdaMART has won ranking competitions and remains a default in production search and recommendation. The practical recipe is short:

Compute scores $s_i$ for all documents in a query with the current ensemble.
For each pair, compute $\lambda_{ij}$ (the RankNet tug) times $|\Delta\text{NDCG}_{ij}|$.
Sum the lambdas per document to get a target gradient.
Fit one regression tree to those targets and add it to the ensemble.

Key idea

LambdaMART never optimizes a differentiable copy of NDCG. It hand-defines gradients that point in the direction NDCG improves, weighting each document-pair by the metric gain from fixing it, and feeds those gradients to gradient boosting.

Both xgboost (objective rank:pairwise, rank:ndcg) and lightgbm (objective lambdarank) implement this family, and they are what you will actually run in practice; for the mechanics of tuning these boosted models, see Chapter 12.

78.5.1 What loss does LambdaRank implicitly minimize?

The Lambda trick looks like a hack: we scale gradients by $|\Delta\text{NDCG}_{ij}|$ without writing down a loss. A natural worry is whether these gradients are even integrable, that is, whether some function $C^{\text{Lambda}}$ has them as its derivatives. They are. With $S_{ij}=+1$ on every pair in $\mathcal{P}$, the LambdaRank gradient on the pair is

\[ \lambda_{ij}^{\text{Lambda}} = -\,\frac{\sigma_0\,|\Delta\text{NDCG}_{ij}|}{1 + e^{\sigma_0 (s_i - s_j)}}, \]

which is exactly $\partial / \partial s_i$ of

\[ C^{\text{Lambda}}_{ij} = |\Delta\text{NDCG}_{ij}| \,\log\!\big(1 + e^{-\sigma_0 (s_i - s_j)}\big), \tag{78.5}\]

provided we treat $|\Delta\text{NDCG}_{ij}|$ as a constant during differentiation. Of course it is not constant, it depends on the current ranking, hence on the scores. Burges showed empirically that the gradient field is conservative to high accuracy, so LambdaRank behaves as if it were descending a well-defined cost close to Equation 78.5, a $\Delta$NDCG-weighted RankNet loss. The weight steepens the logistic penalty precisely on pairs whose swap would most improve NDCG, which is why optimizing the surrogate tracks the metric. Donmez, Svore, and Burges later verified that the LambdaRank update is, to first order, an ascent direction for NDCG at the current scores, giving the method a local-optimality guarantee with respect to the true metric.

78.5.2 RankSVM: the margin formulation of pairwise ranking

RankNet uses a logistic penalty on score differences; RankSVM (Joachims 2002) uses the hinge instead, which connects ranking to the support vector machine of Chapter 19. For a linear scorer and the pair set $\mathcal{P}$, RankSVM solves

\[ \min_{\mathbf{w}} \;\; \tfrac{1}{2}\|\mathbf{w}\|^2 + C \sum_{(i,j)\in\mathcal{P}} \big[\,1 - \mathbf{w}^\top (\mathbf{x}_i - \mathbf{x}_j)\,\big]_+, \tag{78.6}\]

where $[\,z\,]_+ = \max(0,z)$. This is an ordinary soft-margin SVM on the difference vectors $\mathbf{d}_{ij} = \mathbf{x}_i - \mathbf{x}_j$, all with label $+1$: a correctly ordered pair with margin at least $1/\|\mathbf{w}\|$ incurs no loss, an inverted or under-separated pair pays linearly. Kernelizing $\mathbf{d}_{ij}$ buys nonlinear ranking exactly as in classification. RankSVM and RankNet are thus the same pairwise idea under two different convex surrogates of the 0/1 pair-inversion loss (hinge versus logistic), mirroring the SVM-versus-logistic-regression contrast for classification.

78.5.3 Listwise losses: ListNet and the Plackett-Luce model

Listwise methods define a single loss on the whole permutation rather than summing over pairs. The cleanest is ListNet (Cao et al. 2007), built on the Plackett-Luce distribution over permutations. Given scores $s_1,\dots,s_n$, the probability that item $i$ is ranked first is the softmax $\phi(s_i) = e^{s_i} / \sum_{m} e^{s_m}$, and the probability of a full permutation $\pi = (\pi_1, \dots, \pi_n)$ is the product of sequential top-one choices on the shrinking candidate set,

\[ P(\pi \mid s) = \prod_{r=1}^{n} \frac{e^{s_{\pi_r}}}{\sum_{m=r}^{n} e^{s_{\pi_m}}}. \tag{78.7}\]

ListMLE (Xia et al. 2008) takes the negative log of Equation 78.7 for the single ground-truth permutation and minimizes it directly, a clean maximum-likelihood objective. ListNet instead matches the top-one marginal distributions: it minimizes the cross-entropy between the softmax over labels $\phi(y_i)$ and the softmax over scores $\phi(s_i)$,

\[ C^{\text{ListNet}} = -\sum_{i=1}^{n} \frac{e^{y_i}}{\sum_m e^{y_m}} \,\log \frac{e^{s_i}}{\sum_m e^{s_m}}, \tag{78.8}\]

which is differentiable in the scores at $O(n)$ cost per query, far cheaper than the $O(n^2)$ pair enumeration. The gradient is the familiar softmax-cross-entropy form $\partial C^{\text{ListNet}} / \partial s_i = \phi(s_i) - \phi(y_i)$, pushing the score distribution toward the label distribution. The listwise view also clarifies a theoretical point: Xia et al. proved that the ListMLE/Plackett-Luce surrogate is statistically consistent for ranking, meaning its population minimizer recovers the Bayes-optimal ordering, a guarantee that pairwise surrogates do not enjoy in general.

Consistency caveat

Consistency results are reassuring but rest on assumptions (a correctly specified, sufficiently rich $\mathcal{F}$, and in some cases noise conditions) that rarely hold exactly in practice. Empirically, the pairwise-versus-listwise gap is usually small and is dominated by feature quality and label noise, which is why LambdaMART, formally a pairwise method with metric-aware weights, remains the production default despite lacking a clean consistency theorem.

78.6 A runnable demonstration in base R

We now build the ideas from scratch. First we implement the metrics, then we simulate query-document data, train a pointwise baseline and a simple pairwise (RankNet-style) ranker, and compare them. Everything in this section runs in base R plus ggplot2; the gradient-boosted ranker at the end uses xgboost.

78.6.1 Implementing DCG, NDCG, and MAP

The functions below take a vector of true relevance labels ordered by the model’s predicted score (so rel_sorted[1] is the label of the item the model placed first).

Show code

# DCG at cutoff k using the exponential-gain convention:
#   gain_i = 2^rel_i - 1, discount_i = log2(i + 1)
dcg_at_k <- function(rel_sorted, k = length(rel_sorted)) {
  k <- min(k, length(rel_sorted))
  rel <- rel_sorted[seq_len(k)]
  gains <- (2^rel - 1) / log2(seq_len(k) + 1)
  sum(gains)
}

# NDCG = DCG of the model ordering / DCG of the ideal (by-relevance) ordering
ndcg_at_k <- function(rel_sorted, k = length(rel_sorted)) {
  ideal <- sort(rel_sorted, decreasing = TRUE)
  idcg <- dcg_at_k(ideal, k)
  if (idcg == 0) return(0)
  dcg_at_k(rel_sorted, k) / idcg
}

# Average precision for binary relevance, given labels in model order
average_precision <- function(rel_sorted) {
  rel_sorted <- as.integer(rel_sorted > 0)   # binarize
  R <- sum(rel_sorted)
  if (R == 0) return(0)
  hits <- cumsum(rel_sorted)                  # relevant so far at each position
  precision_at_k <- hits / seq_along(rel_sorted)
  sum(precision_at_k * rel_sorted) / R
}

# Reciprocal rank: 1 / position of first relevant item
reciprocal_rank <- function(rel_sorted) {
  hit <- which(rel_sorted > 0)
  if (length(hit) == 0) return(0)
  1 / hit[1]
}

Let us sanity-check the metrics by scoring a few candidate orderings of the same five-item query. The true relevance labels are graded $0$ to $3$. We compare the ideal ordering, a reversed (worst) ordering, and a plausible model ordering.

Show code

# Five items with graded relevance 0..3
ideal_order    <- c(3, 2, 2, 1, 0)   # best documents first
reversed_order <- c(0, 1, 2, 2, 3)   # worst: relevant items buried
model_order    <- c(2, 3, 1, 0, 2)   # a so-so ranking

scores <- data.frame(
  Ordering = c("Ideal", "Reversed", "Model"),
  NDCG = round(c(
    ndcg_at_k(ideal_order),
    ndcg_at_k(reversed_order),
    ndcg_at_k(model_order)
  ), 3),
  `NDCG@3` = round(c(
    ndcg_at_k(ideal_order, 3),
    ndcg_at_k(reversed_order, 3),
    ndcg_at_k(model_order, 3)
  ), 3),
  MAP = round(c(
    average_precision(ideal_order),
    average_precision(reversed_order),
    average_precision(model_order)
  ), 3),
  MRR = round(c(
    reciprocal_rank(ideal_order),
    reciprocal_rank(reversed_order),
    reciprocal_rank(model_order)
  ), 3),
  check.names = FALSE
)
scores
#>   Ordering  NDCG NDCG@3   MAP MRR
#> 1    Ideal 1.000  1.000 1.000 1.0
#> 2 Reversed 0.566  0.205 0.679 0.5
#> 3    Model 0.839  0.762 0.950 1.0

The ideal ordering scores NDCG $= 1$ by construction, the reversed ordering scores lowest, and the model ordering lands in between, confirming the implementations behave sensibly. Table 78.3 presents the same comparison as a formatted exhibit.

Table 78.3: Metric values for three candidate orderings of the same five-item query (graded relevance 0 to 3). The ideal ordering achieves NDCG = 1; the reversed ordering is worst on every metric.

Ordering	NDCG	NDCG@3	MAP	MRR
Ideal	1.000	1.000	1.000	1.0
Reversed	0.566	0.205	0.679	0.5
Model	0.839	0.762	0.950	1.0

Table 78.3 shows the metrics agree on the ranking of the orderings (Ideal > Model > Reversed) while disagreeing on magnitude, which is expected since each metric weights position and grade differently.

78.6.2 Simulating query-document data

We simulate a dataset with several queries, each holding a handful of candidate documents described by two features. The true relevance is a noisy monotone function of a hidden “utility” that depends on the features, so a model that recovers the utility direction will rank well.

Show code

set.seed(2026)

n_queries <- 200
docs_per_query <- 8

make_query <- function(qid) {
  x1 <- rnorm(docs_per_query)                 # e.g. text match score
  x2 <- rnorm(docs_per_query)                 # e.g. popularity score
  # hidden utility: x1 matters twice as much as x2
  utility <- 2 * x1 + 1 * x2 + rnorm(docs_per_query, sd = 0.5)
  # graded relevance 0..3 from utility quantiles within the query
  rel <- cut(utility, breaks = quantile(utility, probs = c(0, .4, .7, .9, 1)),
             labels = FALSE, include.lowest = TRUE) - 1L
  data.frame(qid = qid, x1 = x1, x2 = x2, rel = rel)
}

dat <- do.call(rbind, lapply(seq_len(n_queries), make_query))

# split queries into train and test
train_q <- 1:150
test_q  <- 151:n_queries
train <- dat[dat$qid %in% train_q, ]
test  <- dat[dat$qid %in% test_q, ]

head(train, 8)
#>   qid          x1         x2 rel
#> 1   1  0.52058907  0.1135544   3
#> 2   1 -1.07969076 -0.4737910   0
#> 3   1  0.13923812 -0.4082147   2
#> 4   1 -0.08474878 -0.7304333   1
#> 5   1 -0.66663962 -0.2214366   1
#> 6   1 -2.51608903 -0.2258165   0
#> 7   1 -0.73514680 -2.5468814   0
#> 8   1 -1.02012226  1.3470015   2

A helper evaluates any scoring function by computing mean NDCG and MAP across the test queries.

Show code

# Given a data frame with columns qid, rel and a numeric score vector,
# compute mean NDCG and MAP over queries (sorting each query by score).
evaluate_ranker <- function(df, score, k = docs_per_query) {
  df$score <- score
  qids <- unique(df$qid)
  ndcgs <- numeric(length(qids))
  maps  <- numeric(length(qids))
  for (i in seq_along(qids)) {
    sub <- df[df$qid == qids[i], ]
    ord <- order(sub$score, decreasing = TRUE)   # model ordering
    rel_sorted <- sub$rel[ord]
    ndcgs[i] <- ndcg_at_k(rel_sorted, k)
    maps[i]  <- average_precision(rel_sorted)
  }
  c(NDCG = mean(ndcgs), MAP = mean(maps))
}

78.6.3 A pointwise baseline

The pointwise baseline ignores the ranking structure entirely: fit ordinary least squares predicting rel from the features, then rank by the fitted value.

Show code

pointwise_fit <- lm(rel ~ x1 + x2, data = train)
test_pred_pw  <- predict(pointwise_fit, newdata = test)
pointwise_perf <- evaluate_ranker(test, test_pred_pw)
round(pointwise_perf, 3)
#>  NDCG   MAP 
#> 0.953 0.987

78.6.4 A pairwise (RankNet-style) ranker from scratch

Now the pairwise ranker. We use a linear scoring function $s = \mathbf{w}^\top \mathbf{x}$ and the RankNet log loss over all within-query document pairs, minimized by plain gradient descent. For every ordered pair $(i, j)$ in a query where $i$ is more relevant than $j$, the gradient contribution to $\mathbf{w}$ is the RankNet $\lambda_{ij}$ times the feature difference $(\mathbf{x}_i - \mathbf{x}_j)$.

Show code

# Build all "i more relevant than j" pairs within each training query.
build_pairs <- function(df) {
  feats <- as.matrix(df[, c("x1", "x2")])
  pairs <- list()
  for (q in unique(df$qid)) {
    idx <- which(df$qid == q)
    for (a in idx) for (b in idx) {
      if (df$rel[a] > df$rel[b]) {            # a should outrank b
        pairs[[length(pairs) + 1]] <- feats[a, ] - feats[b, ]
      }
    }
  }
  do.call(rbind, pairs)                       # rows = x_i - x_j for i > j
}

pair_diffs <- build_pairs(train)
cat("Number of training pairs:", nrow(pair_diffs), "\n")
#> Number of training pairs: 3450

# RankNet loss for the "i > j" pairs (target prob = 1):
#   C = sum log(1 + exp(-sigma * w . (x_i - x_j)))
# Gradient wrt w: sum  -sigma * (x_i - x_j) / (1 + exp(sigma * w . diff))
sigma <- 1
w <- c(0, 0)                                  # initial weights for x1, x2
lr <- 0.05
n_iter <- 200
loss_hist <- numeric(n_iter)

for (t in seq_len(n_iter)) {
  margin <- as.numeric(pair_diffs %*% w)      # w . (x_i - x_j) per pair
  loss_hist[t] <- mean(log1p(exp(-sigma * margin)))
  grad <- -sigma * colMeans(pair_diffs / (1 + exp(sigma * margin)))
  w <- w - lr * grad
}

cat("Learned weights (x1, x2):", round(w, 3), "\n")
#> Learned weights (x1, x2): 1.672 0.84
test_pred_pair <- as.numeric(as.matrix(test[, c("x1", "x2")]) %*% w)
pairwise_perf  <- evaluate_ranker(test, test_pred_pair)
round(pairwise_perf, 3)
#>  NDCG   MAP 
#> 0.953 0.987

The learned weights should recover the hidden utility direction (roughly a 2:1 ratio favoring x1), because that is the direction that orders pairs correctly. Figure 78.1 shows the pairwise loss decreasing over gradient-descent iterations alongside the learned weight ratio converging toward the true value of 2.

Show code

# Re-run the descent while recording the weight ratio at each step
w <- c(0, 0); ratio_hist <- numeric(n_iter)
for (t in seq_len(n_iter)) {
  margin <- as.numeric(pair_diffs %*% w)
  grad <- -sigma * colMeans(pair_diffs / (1 + exp(sigma * margin)))
  w <- w - lr * grad
  ratio_hist[t] <- w[1] / w[2]
}

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
plot(loss_hist, type = "l", lwd = 2, col = "steelblue",
     xlab = "Iteration", ylab = "Mean pairwise log loss",
     main = "Pairwise loss")
plot(ratio_hist, type = "l", lwd = 2, col = "darkorange",
     xlab = "Iteration", ylab = "Learned weight ratio (x1 / x2)",
     main = "Weight ratio")
abline(h = 2, lty = 2, col = "grey40")

Figure 78.1: Left: the RankNet pairwise log loss falls smoothly as gradient descent orders more training pairs correctly. Right: the ratio of learned weights (x1 over x2) converges toward the true utility ratio of 2 (dashed line).

Figure 78.1 confirms two things at once: the surrogate loss is being minimized, and minimizing it drives the model toward the feature weighting that produces correct orderings.

We can also verify the analytic RankNet gradient Equation 78.3 against a finite-difference approximation, a cheap check that the calculus above is right and that the descent step uses the correct gradient.

Show code

# Mean RankNet loss over the "i > j" pairs as a function of w
ranknet_loss <- function(w) {
  margin <- as.numeric(pair_diffs %*% w)
  mean(log1p(exp(-sigma * margin)))
}
# Analytic gradient (the one used in the descent loop)
analytic_grad <- function(w) {
  margin <- as.numeric(pair_diffs %*% w)
  -sigma * colMeans(pair_diffs / (1 + exp(sigma * margin)))
}
# Central finite differences
numeric_grad <- function(w, eps = 1e-6) {
  sapply(seq_along(w), function(j) {
    wp <- w; wm <- w; wp[j] <- wp[j] + eps; wm[j] <- wm[j] - eps
    (ranknet_loss(wp) - ranknet_loss(wm)) / (2 * eps)
  })
}
w_chk <- c(0.7, -0.3)
rbind(analytic = analytic_grad(w_chk), numeric = numeric_grad(w_chk))
#>                  x1         x2
#> analytic -0.2432986 -0.3851225
#> numeric  -0.2432986 -0.3851225

The two rows agree to roughly six digits, confirming Equation 78.3.

78.6.5 A gradient-boosted ranker with xgboost

Finally we train a real LambdaMART-style ranker with xgboost, using the rank:ndcg objective. The grouping vector tells xgboost how many rows belong to each query so it only forms pairs within a query.

Show code

library(xgboost)

# xgboost needs rows grouped by query and a group-size vector
train_sorted <- train[order(train$qid), ]
test_sorted  <- test[order(test$qid), ]

dtrain <- xgb.DMatrix(
  data  = as.matrix(train_sorted[, c("x1", "x2")]),
  label = train_sorted$rel
)
setinfo(dtrain, "group", as.numeric(table(train_sorted$qid)))
#> [1] TRUE

xgb_fit <- xgb.train(
  params = list(
    objective = "rank:ndcg",
    eta = 0.1,
    max_depth = 4,
    min_child_weight = 1
  ),
  data = dtrain,
  nrounds = 60,
  verbose = 0
)

test_pred_xgb <- predict(xgb_fit, as.matrix(test_sorted[, c("x1", "x2")]))
xgb_perf <- evaluate_ranker(test_sorted, test_pred_xgb)
round(xgb_perf, 3)
#>  NDCG   MAP 
#> 0.950 0.972

We collect all three rankers into one comparison. Table 78.4 reports mean NDCG and MAP on the held-out queries.

Table 78.4: Held-out ranking quality for three approaches on the simulated query-document data. All three do well here because the true utility is linear; the boosted ranker has the most capacity but the least to gain on this easy, linear signal.

Ranker	NDCG	MAP
Pointwise (OLS)	0.953	0.987
Pairwise (RankNet, linear)	0.953	0.987
Gradient-boosted (xgboost rank:ndcg)	0.950	0.972

Table 78.4 shows that on this deliberately linear problem the simple pairwise ranker is competitive with gradient boosting, a useful reminder that model complexity should match signal complexity. On real data with nonlinear feature interactions, the gradient-boosted ranker typically pulls ahead.

78.7 Theoretical and computational properties

A few properties recur across the methods and are worth stating once.

Computational complexity. A query with $n$ candidates and $R$ relevant items generates $O(nR)$ ordered relevant/non-relevant pairs (up to $O(n^2)$ when grades are dense), so naive pairwise training costs $O(\sum_q n_q^2)$ per epoch, the dominant term for long candidate lists. The factorization in Equation 78.4 keeps the gradient assembly at $O(n^2)$ work for the $\Delta$NDCG terms but reduces the storage and number of tree targets from $O(n^2)$ (one per pair) to $O(n)$ (one summed lambda per document), which is why LambdaMART scales. Listwise softmax losses such as Equation 78.8 cost $O(n)$ per query and avoid pair enumeration entirely. Sorting for NDCG and IDCG is $O(n \log n)$ per query.

The discount makes top-heavy errors cheap to ignore. Because the NDCG discount $1/\log_2(i+1)$ decays, $|\Delta\text{NDCG}_{ij}|$ is tiny when both $i$ and $j$ sit deep in the list. The Lambda weighting therefore concentrates almost all gradient mass on a small number of near-top pairs, which both speeds training (deep pairs can be subsampled with little loss) and explains why LambdaMART optimizes top-$k$ metrics so effectively: it spends its capacity where the metric is sensitive.

Generalization and sample complexity. Generalization in ranking is governed by the number of queries, not the number of documents, since candidates within a query are dependent. Margin-based bounds for RankSVM (and Rademacher-complexity bounds for pairwise classes) scale with $1/\sqrt{|Q|}$ and with the pairwise margin $1/\|\mathbf{w}\|$, not with the raw pair count; adding more documents per query enlarges the training set far less than adding more queries. This is the formal counterpart to the practical rule that label budget is better spent judging more queries than exhaustively judging one.

Failure modes. Three are characteristic. First, pairwise surrogates are position-agnostic by construction (plain RankNet weights a deep swap and a top swap equally), which is the exact deficiency LambdaRank repairs. Second, all of these methods inherit selection bias from logged data: pairs are only formed among items that were shown, so a model can be consistent for the observed conditional distribution yet wrong for the deployment distribution (the basis for inverse-propensity-weighted and unbiased LTR). Third, when relevance grades are noisy, the $2^{\text{rel}}-1$ gain amplifies label error at the top grades, so a single mislabeled “perfect” document can dominate a query’s gradient.

78.8 Practical guidance and pitfalls

A few hard-won lessons separate a working ranking system from a broken one.

Always group by query. The single most common bug is forming document pairs across different queries. A document’s relevance is only meaningful relative to other candidates for the same query; comparing a document from query A against one from query B is nonsense. Every pairwise and listwise method needs a query/group identifier, and metrics must be computed per query then averaged. In xgboost and lightgbm this is the group vector; forget it and the model silently trains on garbage pairs.

Evaluate offline with the metric you care about, online with experiments. Offline NDCG/MAP on logged judgments is necessary but not sufficient. Logged data is biased: users only clicked what was shown, and what was shown was chosen by the old ranker (this is position bias and presentation bias). A model that scores well offline can lose in an A/B test. Treat offline metrics as a filter, not a verdict.

Mind the gain and discount conventions. As noted, two NDCG formulas circulate. If your offline number disagrees with a library’s, check whether both use the $2^{\text{rel}} - 1$ gain and the $\log_2(i+1)$ discount. Mismatched conventions are a frequent source of “my metric does not match the paper.”

Calibrate cutoffs to the surface. A search results page shows ten links; a mobile carousel shows three. Optimize and report NDCG@k for the $k$ that matches where users actually look. Reporting NDCG over all candidates can hide failures right where they hurt.

Warning

Graded-relevance labels are expensive and noisy. Human judges disagree, and “relevance” drifts with intent and time. Before investing in a listwise model, make sure your labels are good enough that the metric differences you are chasing are real and not label noise.

Pairwise count can explode. A query with $n$ candidates generates up to $O(n^2)$ pairs. With long candidate lists this dominates training time. Sampling pairs, capping candidates, or using listwise losses that avoid explicit pair enumeration are the usual remedies.

Features matter more than the loss. In tabular ranking, careful query-document features (match scores, freshness, popularity, personalization signals) typically move NDCG more than the choice between pairwise and listwise. Spend effort there first.

78.9 Further reading

Burges, Shaked, Renshaw, Lazier, Deeds, Hamilton, and Hullender (2005), “Learning to Rank Using Gradient Descent,” introduces RankNet and the pairwise probabilistic loss.
Burges (2010), “From RankNet to LambdaRank to LambdaMART: An Overview,” is the definitive walk-through of the lambda-gradient idea and LambdaMART; readable and self-contained.
Liu (2009), “Learning to Rank for Information Retrieval,” is the standard survey organizing the pointwise, pairwise, and listwise families.
Cao, Qin, Liu, Tsai, and Li (2007), “Learning to Rank: From Pairwise Approach to Listwise Approach,” introduces ListNet and motivates listwise training.
Järvelin and Kekäläinen (2002), “Cumulated Gain-Based Evaluation of IR Techniques,” is the original source for DCG and NDCG.
Manning, Raghavan, and Schütze (2008), Introduction to Information Retrieval, gives the classic treatment of precision, recall, MAP, and MRR with worked examples.
Chen and Guestrin (2016), “XGBoost: A Scalable Tree Boosting System,” documents the gradient-boosted ranking objectives used in the demonstration above.

# Learning to Rank {#sec-learning-to-rank} ```{r} #| include: false source("_common.R") ``` Imagine you type a query into a search box. The system holds millions of documents, and almost all of them are useless to you. Its real job is not to label each document "relevant" or "not relevant" in isolation, but to put the few good ones at the very top of the list, where your eyes actually go. The same problem shows up everywhere: a recommender (see @sec-recommender-systems) deciding which ten products to show on a page, an email client deciding which messages float to the top of the inbox, a hospital triage tool deciding which patients to review first. In all of these, the output that matters is an *ordering*, and the cost of a mistake depends on *where* in the ordering it happens. Putting the best result in position 1 versus position 2 is a small sin; burying it in position 50 is a large one. This is the ranking problem, and it is genuinely different from the regression and classification problems that fill most of this book. In ordinary supervised learning we score each example on its own and we are graded on per-example accuracy. In ranking we are graded on the quality of a *sorted list*, and the grade is dominated by the top of that list. The field that studies how to train models for this objective is called learning to rank (LTR). ::: {.callout-important title="Key idea"} In ranking, the unit of evaluation is an ordered list (typically the list of documents retrieved for a single query), and errors near the top of the list cost far more than errors near the bottom. Good ranking models optimize for that asymmetry directly. ::: Three pieces of vocabulary will recur. A *query* $q$ is the context: a search string, a user, a session. For each query we have a set of *candidate items* (often called documents) $d_1, \dots, d_n$. Each query-document pair has a *relevance label* $y$, which may be binary (relevant or not) or graded (for example $0, 1, 2, 3, 4$ from "useless" to "perfect"). The model learns a scoring function $f(\mathbf{x})$ where $\mathbf{x}$ is a feature vector describing one query-document pair, and the final ranking for a query is produced by sorting its candidates by their scores in descending order. The chapter proceeds as follows. We first make the evaluation problem precise, because in ranking the metric *is* the problem: precision@k, MAP, MRR, and NDCG. We then survey the three families of training approaches (pointwise, pairwise, listwise) and explain why pairwise and listwise methods exist at all. We work through the core ideas of RankNet and LambdaMART, the two algorithms that shaped modern LTR. Finally we build a runnable base-R demonstration: we implement the metrics from scratch, score several candidate rankings, simulate query-document data, and train a simple pairwise ranker, then compare it against a strong gradient-boosted ranker. ## Why ranking is not just regression A natural first instinct is to treat ranking as regression: predict each item's relevance label with squared error, then sort by the prediction. This can work, and it is a legitimate baseline (it is exactly the "pointwise" approach below). But it optimizes the wrong thing in two ways. First, it spends equal effort everywhere. Squared error cares just as much about getting a document at rank 500 right as one at rank 2, even though no user will ever scroll to rank 500. Second, ranking only depends on the *relative order* of scores, not their absolute values. A model that predicts relevances $(3.0, 2.0, 1.0)$ and one that predicts $(0.31, 0.30, 0.29)$ produce the identical ranking, yet squared-error training treats them very differently and may waste capacity matching label magnitudes that have no effect on the output. ::: {.callout-tip title="Intuition"} The output of a ranker is invariant to any order-preserving (monotone) transformation of its scores. The loss we train with should respect that invariance: it should reward getting pairs in the right order and getting the top of the list right, not getting numeric labels exactly. ::: This mismatch between an easy-to-optimize surrogate loss (squared error, log loss) and the true objective (a list-quality metric like NDCG) is the central tension in learning to rank, and every method below is a different way of managing it. ### A formal problem statement Make the setup precise so the later derivations have something to bite on. A query and its candidate set form one training instance. Let $q$ index queries drawn i.i.d. from a query distribution, and for query $q$ let $\mathbf{X}^{(q)} = (\mathbf{x}^{(q)}_1, \dots, \mathbf{x}^{(q)}_{n_q})$ be the feature vectors of its $n_q$ candidates, with relevance labels $\mathbf{y}^{(q)} = (y^{(q)}_1, \dots, y^{(q)}_{n_q})$. A scoring function $f \in \mathcal{F}$, $f : \mathbb{R}^p \to \mathbb{R}$, induces a ranking $\pi_f^{(q)}$ by sorting candidates on $s_i = f(\mathbf{x}^{(q)}_i)$ in descending order (ties broken arbitrarily). Let $M$ be a per-query list metric (NDCG@k, AP, etc.), written $M(\pi, \mathbf{y}) \in [0,1]$ with $M$ larger for better orderings. The learning goal is to maximize the expected metric, equivalently to minimize the ranking risk $$ R(f) = \mathbb{E}_q\!\left[\, 1 - M\big(\pi_f^{(q)}, \mathbf{y}^{(q)}\big) \right], $$ {#eq-learning-to-rank-risk} estimated on a sample by the empirical average over the $|Q|$ training queries. The obstruction named above is now explicit: $\pi_f^{(q)}$ depends on $f$ only through the *sort order* of the scores, so $M(\pi_f^{(q)}, \mathbf{y}^{(q)})$ is piecewise constant in the parameters of $f$. Its gradient is zero wherever it is defined and undefined on the measure-zero set where two scores cross. Direct gradient optimization of @eq-learning-to-rank-risk is therefore impossible, and the three families correspond to three classes of surrogate $\tilde{R}(f) \ge \text{const} \cdot R(f)$ that are differentiable in $f$. ::: {.callout-note} The sort step is the source of all the difficulty. Everything that distinguishes ranking from regression flows from the fact that the map from scores to the evaluated quantity factors through a discontinuous $\operatorname{argsort}$. Surrogates differ in how they smooth or sidestep that discontinuity: pairwise losses replace the global sort with many local two-item comparisons, listwise losses replace it with a differentiable distribution over permutations, and LambdaRank keeps the discontinuous metric but supplies hand-built gradients. ::: ## Ranking metrics We need to measure list quality before we can train for it. Throughout, fix a single query with $n$ candidate items. The model produces a ranking, which we read as an ordered list of positions $1, 2, \dots, n$ from top to bottom. Let $\text{rel}_i$ denote the relevance label of the item placed at position $i$. We will average any per-query metric over all queries in the evaluation set at the end. ### Precision@k and recall@k For binary relevance (each item is relevant, $\text{rel}_i \in \{0,1\}$), the simplest useful metric looks only at the top $k$ positions: $$ \text{Precision@}k = \frac{1}{k} \sum_{i=1}^{k} \text{rel}_i , $$ the fraction of the top $k$ items that are actually relevant. Recall@k instead divides by the total number of relevant items for the query, measuring how many of the good items we managed to surface in the top $k$. Precision@k is intuitive and matches how people consume a results page, but it has a blind spot: it ignores order *within* the top $k$. A relevant item at position 1 and at position $k$ count the same. The next metrics fix that. ### Mean reciprocal rank (MRR) Some tasks have essentially one right answer per query: a "navigational" search ("facebook login"), or a question with a single correct answer. There, what matters is how far down you had to look to find the first relevant item. Let $\text{rank}_q$ be the position of the first relevant item for query $q$. The reciprocal rank is $1/\text{rank}_q$, and averaging over the set of queries $Q$ gives $$ \text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q} . $$ If the first relevant result is at position 1 the query scores $1$; at position 2 it scores $0.5$; at position 10 only $0.1$. MRR rewards getting *a* good answer to the top and ignores everything after the first hit, which is exactly right when there is one answer and wrong when there are many. ### Average precision and MAP When a query has several relevant items, we want credit for surfacing all of them and for ordering them well. Average precision (AP) does this by averaging the precision computed at each position where a relevant item appears: $$ \text{AP}_q = \frac{1}{R_q} \sum_{k=1}^{n} \text{Precision@}k \cdot \text{rel}_k , $$ where $R_q$ is the total number of relevant items for the query and the factor $\text{rel}_k$ means we only add a term at positions that hold a relevant item. Mean average precision (MAP) is the average of $\text{AP}_q$ over all queries: $$ \text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}_q . $$ ::: {.callout-tip title="Intuition"} AP is high when relevant items are clustered near the top. Each relevant item is scored by the precision of the list *up to and including its own position*, so a relevant item that appears after a lot of junk earns a low precision and drags the average down. ::: ### Normalized discounted cumulative gain (NDCG) Precision-based metrics assume binary relevance. Real labels are often graded: a result can be "perfect," "good," "fair," or "useless." NDCG is the workhorse metric for graded relevance, and it makes two ideas explicit, a *gain* that grows with relevance and a *discount* that shrinks with position. Discounted cumulative gain at cutoff $k$ is $$ \text{DCG@}k = \sum_{i=1}^{k} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)} . $$ The numerator $2^{\text{rel}_i} - 1$ is the *gain*: exponentiating the relevance means a single "perfect" document is worth much more than several "fair" ones, which matches how users feel about quality. The denominator $\log_2(i+1)$ is the *discount*: it grows with position, so the same document contributes less the further down it sits. Position 1 has discount $\log_2 2 = 1$ (no discount), position 2 has $\log_2 3 \approx 1.585$, and so on. DCG is not comparable across queries because queries differ in how many relevant items they have and how relevant those items are. We normalize by the best possible DCG for that query, the *ideal* DCG (IDCG@k), obtained by sorting the items by true relevance descending: $$ \text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k} . $$ NDCG@k lies in $[0, 1]$, equals $1$ for the ideal ordering, and averages cleanly over queries. It is the metric most LTR systems report and, indirectly, the metric LambdaMART tries to optimize. ::: {.callout-note} There are two slightly different DCG conventions in the literature. The one above (with the $2^{\text{rel}}-1$ gain) is the form used by most modern toolkits and competitions; an older form uses $\text{rel}_i$ directly in the numerator. They agree for binary relevance up to scaling. We use the exponential-gain form throughout. ::: @tbl-learning-to-rank-metric-summary summarizes when each metric is the right choice. ```{r tbl-learning-to-rank-metric-summary, echo=FALSE} metric_tab <- data.frame( Metric = c("Precision@k", "MRR", "MAP", "NDCG@k"), `Relevance type` = c("Binary", "Binary", "Binary", "Graded or binary"), `Rewards order in top-k` = c("No", "First hit only", "Yes", "Yes"), `Best used when` = c( "You only care how many of the top k are good", "There is one right answer per query (navigational search, QA)", "Several relevant items per query, all worth finding", "Graded relevance and position both matter (web search, recsys)" ), check.names = FALSE ) knitr::kable( metric_tab, caption = "Ranking metrics and the situations they fit. Precision@k is the simplest; NDCG@k is the most general because it handles graded relevance and discounts by position." ) ``` @tbl-learning-to-rank-metric-summary shows that the metrics form a rough ladder of generality, from precision@k (binary, order-insensitive within the cutoff) up to NDCG@k (graded relevance with an explicit position discount). ## Three families of training approaches Given a metric, how do we train a scoring function $f(\mathbf{x})$ to do well on it? The metrics above are step functions of the ranking: they change only when two items swap order, so they are flat almost everywhere and have zero gradient. We cannot optimize them directly with gradient descent. The three families of LTR methods are three answers to "what differentiable surrogate do we optimize instead?" Pointwise. Treat each query-document pair independently and predict its relevance label with ordinary regression or classification. The loss is a sum over individual pairs, for example squared error $\sum (f(\mathbf{x}) - y)^2$. Then sort by predicted score. This is the simplest approach and reuses every regression or classification tool you already know, but as argued above it optimizes label magnitudes rather than order, and it weights all positions equally. Pairwise. Look at *pairs* of documents within the same query and learn to get their relative order right. If document $a$ is more relevant than document $b$, we want $f(\mathbf{x}_a) > f(\mathbf{x}_b)$. The loss penalizes pairs that are ordered incorrectly. This turns ranking into a binary classification problem over pairs ("is $a$ better than $b$?") and aligns much better with the goal, since ranking *is* about relative order. RankNet, RankSVM, and the original gradient-boosted rankers are pairwise. Listwise. Define a loss over the *entire ranked list* for a query at once, ideally a smooth surrogate for the actual metric (NDCG). This is the most direct but also the most complex. ListNet, ListMLE, and the NDCG-aware variant of LambdaMART live here. ::: {.callout-tip title="When to use this"} Start pointwise for a baseline and when labels are reliable cardinal values. Move to pairwise (the usual sweet spot) when you have many items per query and care about order. Reach for listwise when you are squeezing out the last few points of NDCG and can afford the complexity. In practice, gradient-boosted pairwise/listwise rankers (LambdaMART) are the default production choice for tabular ranking features. ::: @tbl-learning-to-rank-approaches lays out the tradeoffs. ```{r tbl-learning-to-rank-approaches, echo=FALSE} appr_tab <- data.frame( Approach = c("Pointwise", "Pairwise", "Listwise"), `What it models` = c( "Score of one item", "Relative order of an item pair", "Quality of the whole list" ), `Training signal` = c( "Item label (regression/classification)", "Which item in a pair ranks higher", "List-level metric surrogate" ), `Examples` = c( "Linear/GBM regression on labels", "RankNet, RankSVM, GBM pairwise", "ListNet, ListMLE, LambdaMART" ), `Tradeoff` = c( "Simple, but ignores order and position", "Aligns with order, scales as pairs (can be many)", "Most aligned with metric, most complex" ), check.names = FALSE ) knitr::kable( appr_tab, caption = "The three families of learning-to-rank methods. Pairwise is the common default; listwise is the most direct optimizer of list-level metrics." ) ``` @tbl-learning-to-rank-approaches makes clear that moving down the table buys closer alignment with the true metric at the cost of more machinery. ## RankNet: a pairwise model with a probabilistic loss RankNet [@burges2005learning] is the cleanest way to see the pairwise idea, and it underlies the later Lambda methods. The model is any differentiable scoring function $f$ (originally a neural network, see @sec-neural-networks, but it can be linear). For two documents $i$ and $j$ in the same query with scores $s_i = f(\mathbf{x}_i)$ and $s_j = f(\mathbf{x}_j)$, define the modeled probability that $i$ should rank above $j$ using a logistic (sigmoid) function of the score difference: $$ P_{ij} = \frac{1}{1 + e^{-\sigma (s_i - s_j)}} , $$ where $\sigma > 0$ is a shape parameter. The larger the score gap in the right direction, the closer $P_{ij}$ is to $1$. Let $\bar{P}_{ij}$ be the target: $1$ if $i$ is truly more relevant than $j$, $0$ if less, $0.5$ if tied. RankNet uses the cross-entropy (log loss) between target and model: $$ C_{ij} = -\bar{P}_{ij} \log P_{ij} - (1 - \bar{P}_{ij}) \log (1 - P_{ij}) . $$ For a pair where $i$ truly outranks $j$ ($\bar{P}_{ij} = 1$), this simplifies to $C_{ij} = \log\!\big(1 + e^{-\sigma(s_i - s_j)}\big)$, a smooth, convex penalty that is near zero when $s_i \gg s_j$ and grows linearly when the order is badly wrong. This is exactly a logistic loss applied to score *differences* rather than raw scores. We sum $C_{ij}$ over all pairs whose order we know and minimize by gradient descent. ::: {.callout-note} Because the loss depends only on $s_i - s_j$, RankNet learns relative order and is naturally invariant to adding a constant to every score. That is the order-invariance property we wanted from a ranking loss. ::: The gradient of $C_{ij}$ with respect to the score difference has the clean form $$ \frac{\partial C_{ij}}{\partial s_i} = -\frac{\partial C_{ij}}{\partial s_j} = \sigma \left( \frac{1}{2}(1 - S_{ij}) - \frac{1}{1 + e^{\sigma(s_i - s_j)}} \right), $$ where $S_{ij} \in \{-1, 0, +1\}$ encodes whether $i$ is less, equally, or more relevant than $j$. Define $\lambda_{ij}$ as this gradient. A key observation is that each document's total gradient is just the sum of $\lambda_{ij}$ over all pairs it participates in. These per-pair forces $\lambda_{ij}$ can be read as little "tugs": document $i$ is pulled up and $j$ is pulled down whenever $i$ should outrank $j$ but does not by enough. ### Deriving the RankNet gradient It is worth doing the calculus once, because the result is the engine of every Lambda method. Adopt the convention $S_{ij} \in \{-1,0,+1\}$ and write the target as $\bar{P}_{ij} = \tfrac{1}{2}(1 + S_{ij})$, so $S_{ij}=+1$ gives $\bar{P}_{ij}=1$, a tie gives $\tfrac12$, and $S_{ij}=-1$ gives $0$. Let $s_{ij} = s_i - s_j$. The model probability is $P_{ij} = \sigma(\sigma_0 s_{ij})$ where $\sigma(z) = 1/(1+e^{-z})$ and $\sigma_0>0$ is the shape parameter (called $\sigma$ in the text above). Substituting $\bar{P}_{ij}$ into the cross-entropy and using $1 - \sigma(z) = \sigma(-z)$, $$ C_{ij} = -\tfrac{1}{2}(1+S_{ij}) \log \sigma(\sigma_0 s_{ij}) - \tfrac{1}{2}(1 - S_{ij}) \log \sigma(-\sigma_0 s_{ij}). $$ Using the textbook identity $\log \sigma(z) = -\log(1 + e^{-z})$ and collecting terms, $$ C_{ij} = \tfrac{1}{2}(1 - S_{ij})\,\sigma_0 s_{ij} + \log\!\big(1 + e^{-\sigma_0 s_{ij}}\big). $$ {#eq-learning-to-rank-ranknet-cost} The first term is linear, the second is the softplus. For the canonical case $S_{ij}=+1$ the linear term vanishes and @eq-learning-to-rank-ranknet-cost reduces to $\log(1 + e^{-\sigma_0 s_{ij}})$, matching the simplified form stated earlier. Now differentiate. Since $\frac{d}{dz}\log(1+e^{-z}) = -\sigma(-z) = -(1-\sigma(z))$ and $\partial s_{ij}/\partial s_i = 1$, $$ \frac{\partial C_{ij}}{\partial s_i} = \sigma_0\!\left[\tfrac{1}{2}(1 - S_{ij}) - \big(1 - \sigma(\sigma_0 s_{ij})\big)\right] = \sigma_0\!\left[\tfrac{1}{2}(1 - S_{ij}) - \frac{1}{1 + e^{\sigma_0 s_{ij}}}\right], $$ {#eq-learning-to-rank-ranknet-grad} which is exactly $\lambda_{ij}$ as stated, and $\partial C_{ij}/\partial s_j = -\partial C_{ij}/\partial s_i$ because $C_{ij}$ depends on $s_i, s_j$ only through $s_{ij}$. The antisymmetry $\lambda_{ji} = -\lambda_{ij}$ is what lets us aggregate: with $\mathcal{P}$ the set of pairs $(i,j)$ with $i$ more relevant than $j$ (so $S_{ij}=+1$ throughout, the standard bookkeeping), the gradient on a single document accumulates the tugs from every pair it touches, $$ \frac{\partial C}{\partial s_i} = \sum_{j : (i,j) \in \mathcal{P}} \lambda_{ij} \;-\; \sum_{j : (j,i) \in \mathcal{P}} \lambda_{ji} \;\equiv\; \lambda_i, $$ {#eq-learning-to-rank-ranknet-doc-grad} and the chain rule gives $\partial C / \partial \mathbf{w} = \sum_i \lambda_i \, \partial s_i / \partial \mathbf{w}$. For the linear scorer $s_i = \mathbf{w}^\top \mathbf{x}_i$ this is $\sum_{(i,j) \in \mathcal{P}} \lambda_{ij} (\mathbf{x}_i - \mathbf{x}_j)$, the update implemented in the demonstration. The reduction from $O(n^2)$ per-pair gradients to $n$ per-document gradients via @eq-learning-to-rank-ranknet-doc-grad is the factorization trick that makes the method affordable inside a tree learner. ::: {.callout-note title="Convexity"} For a linear scorer, @eq-learning-to-rank-ranknet-cost is convex in $\mathbf{w}$: it is a sum of an affine term and the softplus $\log(1+e^{-\sigma_0 \mathbf{w}^\top \mathbf{d}_{ij}})$ composed with the affine map $\mathbf{w} \mapsto \mathbf{w}^\top \mathbf{d}_{ij}$, and softplus is convex. The pairwise RankNet problem over a linear class therefore has no spurious local minima, which is why the simple gradient descent in the demonstration converges reliably. Convexity is lost once $f$ is a neural network or a tree ensemble. ::: ## From RankNet to LambdaMART RankNet minimizes pairwise error, but pairwise error is not NDCG. Swapping two documents near the bottom of the list barely changes NDCG, while swapping two near the top changes it a lot. Plain RankNet treats both swaps as equally important. LambdaRank [@burges2010ranknet] fixes this with a remarkably simple trick: scale each pairwise tug $\lambda_{ij}$ by the *change in NDCG* that would result from swapping documents $i$ and $j$: $$ \lambda_{ij}^{\text{Lambda}} = \lambda_{ij} \cdot \left| \Delta \text{NDCG}_{ij} \right| . $$ The model never differentiates NDCG (it cannot, NDCG is flat almost everywhere). Instead it defines the gradients *directly*, weighting each pairwise force by how much fixing that pair would improve the metric. Pairs that matter for NDCG pull harder. Empirically these "lambda" gradients optimize NDCG very well even though no closed-form loss is being differentiated. LambdaMART [@burges2010ranknet] is LambdaRank's gradient signal plugged into gradient-boosted regression trees (MART = Multiple Additive Regression Trees, the boosting machine of @sec-boosting) instead of a neural net. At each boosting round it computes the lambda gradients for every document, fits a regression tree to them, and adds that tree to the ensemble. LambdaMART has won ranking competitions and remains a default in production search and recommendation. The practical recipe is short: - Compute scores $s_i$ for all documents in a query with the current ensemble. - For each pair, compute $\lambda_{ij}$ (the RankNet tug) times $|\Delta\text{NDCG}_{ij}|$. - Sum the lambdas per document to get a target gradient. - Fit one regression tree to those targets and add it to the ensemble. ::: {.callout-important title="Key idea"} LambdaMART never optimizes a differentiable copy of NDCG. It hand-defines gradients that point in the direction NDCG improves, weighting each document-pair by the metric gain from fixing it, and feeds those gradients to gradient boosting. ::: Both `xgboost` (objective `rank:pairwise`, `rank:ndcg`) and `lightgbm` (objective `lambdarank`) implement this family, and they are what you will actually run in practice; for the mechanics of tuning these boosted models, see @sec-gradient-boosting. ### What loss does LambdaRank implicitly minimize? The Lambda trick looks like a hack: we scale gradients by $|\Delta\text{NDCG}_{ij}|$ without writing down a loss. A natural worry is whether these gradients are even integrable, that is, whether some function $C^{\text{Lambda}}$ has them as its derivatives. They are. With $S_{ij}=+1$ on every pair in $\mathcal{P}$, the LambdaRank gradient on the pair is $$ \lambda_{ij}^{\text{Lambda}} = -\,\frac{\sigma_0\,|\Delta\text{NDCG}_{ij}|}{1 + e^{\sigma_0 (s_i - s_j)}}, $$ which is exactly $\partial / \partial s_i$ of $$ C^{\text{Lambda}}_{ij} = |\Delta\text{NDCG}_{ij}| \,\log\!\big(1 + e^{-\sigma_0 (s_i - s_j)}\big), $$ {#eq-learning-to-rank-lambdaloss} *provided* we treat $|\Delta\text{NDCG}_{ij}|$ as a constant during differentiation. Of course it is not constant, it depends on the current ranking, hence on the scores. Burges showed empirically that the gradient field is conservative to high accuracy, so LambdaRank behaves as if it were descending a well-defined cost close to @eq-learning-to-rank-lambdaloss, a $\Delta$NDCG-weighted RankNet loss. The weight steepens the logistic penalty precisely on pairs whose swap would most improve NDCG, which is why optimizing the surrogate tracks the metric. Donmez, Svore, and Burges later verified that the LambdaRank update is, to first order, an ascent direction for NDCG at the current scores, giving the method a local-optimality guarantee with respect to the true metric. ### RankSVM: the margin formulation of pairwise ranking RankNet uses a logistic penalty on score differences; RankSVM [@joachims2002optimizing] uses the hinge instead, which connects ranking to the support vector machine of @sec-svm. For a linear scorer and the pair set $\mathcal{P}$, RankSVM solves $$ \min_{\mathbf{w}} \;\; \tfrac{1}{2}\|\mathbf{w}\|^2 + C \sum_{(i,j)\in\mathcal{P}} \big[\,1 - \mathbf{w}^\top (\mathbf{x}_i - \mathbf{x}_j)\,\big]_+, $$ {#eq-learning-to-rank-ranksvm} where $[\,z\,]_+ = \max(0,z)$. This is an ordinary soft-margin SVM on the *difference vectors* $\mathbf{d}_{ij} = \mathbf{x}_i - \mathbf{x}_j$, all with label $+1$: a correctly ordered pair with margin at least $1/\|\mathbf{w}\|$ incurs no loss, an inverted or under-separated pair pays linearly. Kernelizing $\mathbf{d}_{ij}$ buys nonlinear ranking exactly as in classification. RankSVM and RankNet are thus the same pairwise idea under two different convex surrogates of the 0/1 pair-inversion loss (hinge versus logistic), mirroring the SVM-versus-logistic-regression contrast for classification. ### Listwise losses: ListNet and the Plackett-Luce model Listwise methods define a single loss on the whole permutation rather than summing over pairs. The cleanest is ListNet [@cao2007learning], built on the Plackett-Luce distribution over permutations. Given scores $s_1,\dots,s_n$, the probability that item $i$ is ranked first is the softmax $\phi(s_i) = e^{s_i} / \sum_{m} e^{s_m}$, and the probability of a full permutation $\pi = (\pi_1, \dots, \pi_n)$ is the product of sequential top-one choices on the shrinking candidate set, $$ P(\pi \mid s) = \prod_{r=1}^{n} \frac{e^{s_{\pi_r}}}{\sum_{m=r}^{n} e^{s_{\pi_m}}}. $$ {#eq-learning-to-rank-plackett-luce} ListMLE [@xia2008listwise] takes the negative log of @eq-learning-to-rank-plackett-luce for the single ground-truth permutation and minimizes it directly, a clean maximum-likelihood objective. ListNet instead matches the *top-one* marginal distributions: it minimizes the cross-entropy between the softmax over labels $\phi(y_i)$ and the softmax over scores $\phi(s_i)$, $$ C^{\text{ListNet}} = -\sum_{i=1}^{n} \frac{e^{y_i}}{\sum_m e^{y_m}} \,\log \frac{e^{s_i}}{\sum_m e^{s_m}}, $$ {#eq-learning-to-rank-listnet} which is differentiable in the scores at $O(n)$ cost per query, far cheaper than the $O(n^2)$ pair enumeration. The gradient is the familiar softmax-cross-entropy form $\partial C^{\text{ListNet}} / \partial s_i = \phi(s_i) - \phi(y_i)$, pushing the score distribution toward the label distribution. The listwise view also clarifies a theoretical point: Xia et al. proved that the ListMLE/Plackett-Luce surrogate is *statistically consistent* for ranking, meaning its population minimizer recovers the Bayes-optimal ordering, a guarantee that pairwise surrogates do not enjoy in general. ::: {.callout-note title="Consistency caveat"} Consistency results are reassuring but rest on assumptions (a correctly specified, sufficiently rich $\mathcal{F}$, and in some cases noise conditions) that rarely hold exactly in practice. Empirically, the pairwise-versus-listwise gap is usually small and is dominated by feature quality and label noise, which is why LambdaMART, formally a pairwise method with metric-aware weights, remains the production default despite lacking a clean consistency theorem. ::: ## A runnable demonstration in base R We now build the ideas from scratch. First we implement the metrics, then we simulate query-document data, train a pointwise baseline and a simple pairwise (RankNet-style) ranker, and compare them. Everything in this section runs in base R plus `ggplot2`; the gradient-boosted ranker at the end uses `xgboost`. ### Implementing DCG, NDCG, and MAP The functions below take a vector of true relevance labels ordered by the model's predicted score (so `rel_sorted[1]` is the label of the item the model placed first). ```{r learning-to-rank-metric-fns} # DCG at cutoff k using the exponential-gain convention: # gain_i = 2^rel_i - 1, discount_i = log2(i + 1) dcg_at_k <- function(rel_sorted, k = length(rel_sorted)) { k <- min(k, length(rel_sorted)) rel <- rel_sorted[seq_len(k)] gains <- (2^rel - 1) / log2(seq_len(k) + 1) sum(gains) } # NDCG = DCG of the model ordering / DCG of the ideal (by-relevance) ordering ndcg_at_k <- function(rel_sorted, k = length(rel_sorted)) { ideal <- sort(rel_sorted, decreasing = TRUE) idcg <- dcg_at_k(ideal, k) if (idcg == 0) return(0) dcg_at_k(rel_sorted, k) / idcg } # Average precision for binary relevance, given labels in model order average_precision <- function(rel_sorted) { rel_sorted <- as.integer(rel_sorted > 0) # binarize R <- sum(rel_sorted) if (R == 0) return(0) hits <- cumsum(rel_sorted) # relevant so far at each position precision_at_k <- hits / seq_along(rel_sorted) sum(precision_at_k * rel_sorted) / R } # Reciprocal rank: 1 / position of first relevant item reciprocal_rank <- function(rel_sorted) { hit <- which(rel_sorted > 0) if (length(hit) == 0) return(0) 1 / hit[1] } ``` Let us sanity-check the metrics by scoring a few candidate orderings of the same five-item query. The true relevance labels are graded $0$ to $3$. We compare the ideal ordering, a reversed (worst) ordering, and a plausible model ordering. ```{r learning-to-rank-score-orderings} # Five items with graded relevance 0..3 ideal_order <- c(3, 2, 2, 1, 0) # best documents first reversed_order <- c(0, 1, 2, 2, 3) # worst: relevant items buried model_order <- c(2, 3, 1, 0, 2) # a so-so ranking scores <- data.frame( Ordering = c("Ideal", "Reversed", "Model"), NDCG = round(c( ndcg_at_k(ideal_order), ndcg_at_k(reversed_order), ndcg_at_k(model_order) ), 3), `NDCG@3` = round(c( ndcg_at_k(ideal_order, 3), ndcg_at_k(reversed_order, 3), ndcg_at_k(model_order, 3) ), 3), MAP = round(c( average_precision(ideal_order), average_precision(reversed_order), average_precision(model_order) ), 3), MRR = round(c( reciprocal_rank(ideal_order), reciprocal_rank(reversed_order), reciprocal_rank(model_order) ), 3), check.names = FALSE ) scores ``` The ideal ordering scores NDCG $= 1$ by construction, the reversed ordering scores lowest, and the model ordering lands in between, confirming the implementations behave sensibly. @tbl-learning-to-rank-orderings presents the same comparison as a formatted exhibit. ```{r tbl-learning-to-rank-orderings, echo=FALSE} knitr::kable( scores, caption = "Metric values for three candidate orderings of the same five-item query (graded relevance 0 to 3). The ideal ordering achieves NDCG = 1; the reversed ordering is worst on every metric." ) ``` @tbl-learning-to-rank-orderings shows the metrics agree on the ranking of the orderings (Ideal > Model > Reversed) while disagreeing on magnitude, which is expected since each metric weights position and grade differently. ### Simulating query-document data We simulate a dataset with several queries, each holding a handful of candidate documents described by two features. The true relevance is a noisy monotone function of a hidden "utility" that depends on the features, so a model that recovers the utility direction will rank well. ```{r learning-to-rank-simulate} set.seed(2026) n_queries <- 200 docs_per_query <- 8 make_query <- function(qid) { x1 <- rnorm(docs_per_query) # e.g. text match score x2 <- rnorm(docs_per_query) # e.g. popularity score # hidden utility: x1 matters twice as much as x2 utility <- 2 * x1 + 1 * x2 + rnorm(docs_per_query, sd = 0.5) # graded relevance 0..3 from utility quantiles within the query rel <- cut(utility, breaks = quantile(utility, probs = c(0, .4, .7, .9, 1)), labels = FALSE, include.lowest = TRUE) - 1L data.frame(qid = qid, x1 = x1, x2 = x2, rel = rel) } dat <- do.call(rbind, lapply(seq_len(n_queries), make_query)) # split queries into train and test train_q <- 1:150 test_q <- 151:n_queries train <- dat[dat$qid %in% train_q, ] test <- dat[dat$qid %in% test_q, ] head(train, 8) ``` A helper evaluates any scoring function by computing mean NDCG and MAP across the test queries. ```{r learning-to-rank-eval-helper} # Given a data frame with columns qid, rel and a numeric score vector, # compute mean NDCG and MAP over queries (sorting each query by score). evaluate_ranker <- function(df, score, k = docs_per_query) { df$score <- score qids <- unique(df$qid) ndcgs <- numeric(length(qids)) maps <- numeric(length(qids)) for (i in seq_along(qids)) { sub <- df[df$qid == qids[i], ] ord <- order(sub$score, decreasing = TRUE) # model ordering rel_sorted <- sub$rel[ord] ndcgs[i] <- ndcg_at_k(rel_sorted, k) maps[i] <- average_precision(rel_sorted) } c(NDCG = mean(ndcgs), MAP = mean(maps)) } ``` ### A pointwise baseline The pointwise baseline ignores the ranking structure entirely: fit ordinary least squares predicting `rel` from the features, then rank by the fitted value. ```{r learning-to-rank-pointwise} pointwise_fit <- lm(rel ~ x1 + x2, data = train) test_pred_pw <- predict(pointwise_fit, newdata = test) pointwise_perf <- evaluate_ranker(test, test_pred_pw) round(pointwise_perf, 3) ``` ### A pairwise (RankNet-style) ranker from scratch Now the pairwise ranker. We use a linear scoring function $s = \mathbf{w}^\top \mathbf{x}$ and the RankNet log loss over all within-query document pairs, minimized by plain gradient descent. For every ordered pair $(i, j)$ in a query where $i$ is more relevant than $j$, the gradient contribution to $\mathbf{w}$ is the RankNet $\lambda_{ij}$ times the feature difference $(\mathbf{x}_i - \mathbf{x}_j)$. ```{r learning-to-rank-pairwise} # Build all "i more relevant than j" pairs within each training query. build_pairs <- function(df) { feats <- as.matrix(df[, c("x1", "x2")]) pairs <- list() for (q in unique(df$qid)) { idx <- which(df$qid == q) for (a in idx) for (b in idx) { if (df$rel[a] > df$rel[b]) { # a should outrank b pairs[[length(pairs) + 1]] <- feats[a, ] - feats[b, ] } } } do.call(rbind, pairs) # rows = x_i - x_j for i > j } pair_diffs <- build_pairs(train) cat("Number of training pairs:", nrow(pair_diffs), "\n") # RankNet loss for the "i > j" pairs (target prob = 1): # C = sum log(1 + exp(-sigma * w . (x_i - x_j))) # Gradient wrt w: sum -sigma * (x_i - x_j) / (1 + exp(sigma * w . diff)) sigma <- 1 w <- c(0, 0) # initial weights for x1, x2 lr <- 0.05 n_iter <- 200 loss_hist <- numeric(n_iter) for (t in seq_len(n_iter)) { margin <- as.numeric(pair_diffs %*% w) # w . (x_i - x_j) per pair loss_hist[t] <- mean(log1p(exp(-sigma * margin))) grad <- -sigma * colMeans(pair_diffs / (1 + exp(sigma * margin))) w <- w - lr * grad } cat("Learned weights (x1, x2):", round(w, 3), "\n") test_pred_pair <- as.numeric(as.matrix(test[, c("x1", "x2")]) %*% w) pairwise_perf <- evaluate_ranker(test, test_pred_pair) round(pairwise_perf, 3) ``` The learned weights should recover the hidden utility direction (roughly a 2:1 ratio favoring `x1`), because that is the direction that orders pairs correctly. @fig-learning-to-rank-training shows the pairwise loss decreasing over gradient-descent iterations alongside the learned weight ratio converging toward the true value of 2. ```{r fig-learning-to-rank-training, fig.cap="Left: the RankNet pairwise log loss falls smoothly as gradient descent orders more training pairs correctly. Right: the ratio of learned weights (x1 over x2) converges toward the true utility ratio of 2 (dashed line).", fig.width=8, fig.height=4} # Re-run the descent while recording the weight ratio at each step w <- c(0, 0); ratio_hist <- numeric(n_iter) for (t in seq_len(n_iter)) { margin <- as.numeric(pair_diffs %*% w) grad <- -sigma * colMeans(pair_diffs / (1 + exp(sigma * margin))) w <- w - lr * grad ratio_hist[t] <- w[1] / w[2] } par(mfrow = c(1, 2), mar = c(4, 4, 2, 1)) plot(loss_hist, type = "l", lwd = 2, col = "steelblue", xlab = "Iteration", ylab = "Mean pairwise log loss", main = "Pairwise loss") plot(ratio_hist, type = "l", lwd = 2, col = "darkorange", xlab = "Iteration", ylab = "Learned weight ratio (x1 / x2)", main = "Weight ratio") abline(h = 2, lty = 2, col = "grey40") ``` @fig-learning-to-rank-training confirms two things at once: the surrogate loss is being minimized, and minimizing it drives the model toward the feature weighting that produces correct orderings. We can also verify the analytic RankNet gradient @eq-learning-to-rank-ranknet-grad against a finite-difference approximation, a cheap check that the calculus above is right and that the descent step uses the correct gradient. ```{r learning-to-rank-grad-check} # Mean RankNet loss over the "i > j" pairs as a function of w ranknet_loss <- function(w) { margin <- as.numeric(pair_diffs %*% w) mean(log1p(exp(-sigma * margin))) } # Analytic gradient (the one used in the descent loop) analytic_grad <- function(w) { margin <- as.numeric(pair_diffs %*% w) -sigma * colMeans(pair_diffs / (1 + exp(sigma * margin))) } # Central finite differences numeric_grad <- function(w, eps = 1e-6) { sapply(seq_along(w), function(j) { wp <- w; wm <- w; wp[j] <- wp[j] + eps; wm[j] <- wm[j] - eps (ranknet_loss(wp) - ranknet_loss(wm)) / (2 * eps) }) } w_chk <- c(0.7, -0.3) rbind(analytic = analytic_grad(w_chk), numeric = numeric_grad(w_chk)) ``` The two rows agree to roughly six digits, confirming @eq-learning-to-rank-ranknet-grad. ### A gradient-boosted ranker with xgboost Finally we train a real LambdaMART-style ranker with `xgboost`, using the `rank:ndcg` objective. The grouping vector tells xgboost how many rows belong to each query so it only forms pairs within a query. ```{r learning-to-rank-xgboost} library(xgboost) # xgboost needs rows grouped by query and a group-size vector train_sorted <- train[order(train$qid), ] test_sorted <- test[order(test$qid), ] dtrain <- xgb.DMatrix( data = as.matrix(train_sorted[, c("x1", "x2")]), label = train_sorted$rel ) setinfo(dtrain, "group", as.numeric(table(train_sorted$qid))) xgb_fit <- xgb.train( params = list( objective = "rank:ndcg", eta = 0.1, max_depth = 4, min_child_weight = 1 ), data = dtrain, nrounds = 60, verbose = 0 ) test_pred_xgb <- predict(xgb_fit, as.matrix(test_sorted[, c("x1", "x2")])) xgb_perf <- evaluate_ranker(test_sorted, test_pred_xgb) round(xgb_perf, 3) ``` We collect all three rankers into one comparison. @tbl-learning-to-rank-comparison reports mean NDCG and MAP on the held-out queries. ```{r tbl-learning-to-rank-comparison, echo=FALSE} comp <- data.frame( Ranker = c("Pointwise (OLS)", "Pairwise (RankNet, linear)", "Gradient-boosted (xgboost rank:ndcg)"), NDCG = round(c(pointwise_perf["NDCG"], pairwise_perf["NDCG"], xgb_perf["NDCG"]), 3), MAP = round(c(pointwise_perf["MAP"], pairwise_perf["MAP"], xgb_perf["MAP"]), 3), row.names = NULL ) knitr::kable( comp, caption = "Held-out ranking quality for three approaches on the simulated query-document data. All three do well here because the true utility is linear; the boosted ranker has the most capacity but the least to gain on this easy, linear signal." ) ``` @tbl-learning-to-rank-comparison shows that on this deliberately linear problem the simple pairwise ranker is competitive with gradient boosting, a useful reminder that model complexity should match signal complexity. On real data with nonlinear feature interactions, the gradient-boosted ranker typically pulls ahead. ## Theoretical and computational properties A few properties recur across the methods and are worth stating once. Computational complexity. A query with $n$ candidates and $R$ relevant items generates $O(nR)$ ordered relevant/non-relevant pairs (up to $O(n^2)$ when grades are dense), so naive pairwise training costs $O(\sum_q n_q^2)$ per epoch, the dominant term for long candidate lists. The factorization in @eq-learning-to-rank-ranknet-doc-grad keeps the *gradient assembly* at $O(n^2)$ work for the $\Delta$NDCG terms but reduces the storage and number of tree targets from $O(n^2)$ (one per pair) to $O(n)$ (one summed lambda per document), which is why LambdaMART scales. Listwise softmax losses such as @eq-learning-to-rank-listnet cost $O(n)$ per query and avoid pair enumeration entirely. Sorting for NDCG and IDCG is $O(n \log n)$ per query. The discount makes top-heavy errors cheap to ignore. Because the NDCG discount $1/\log_2(i+1)$ decays, $|\Delta\text{NDCG}_{ij}|$ is tiny when both $i$ and $j$ sit deep in the list. The Lambda weighting therefore concentrates almost all gradient mass on a small number of near-top pairs, which both speeds training (deep pairs can be subsampled with little loss) and explains why LambdaMART optimizes top-$k$ metrics so effectively: it spends its capacity where the metric is sensitive. Generalization and sample complexity. Generalization in ranking is governed by the number of *queries*, not the number of documents, since candidates within a query are dependent. Margin-based bounds for RankSVM (and Rademacher-complexity bounds for pairwise classes) scale with $1/\sqrt{|Q|}$ and with the pairwise margin $1/\|\mathbf{w}\|$, not with the raw pair count; adding more documents per query enlarges the training set far less than adding more queries. This is the formal counterpart to the practical rule that label budget is better spent judging more queries than exhaustively judging one. Failure modes. Three are characteristic. First, pairwise surrogates are *position-agnostic by construction* (plain RankNet weights a deep swap and a top swap equally), which is the exact deficiency LambdaRank repairs. Second, all of these methods inherit *selection bias* from logged data: pairs are only formed among items that were shown, so a model can be consistent for the observed conditional distribution yet wrong for the deployment distribution (the basis for inverse-propensity-weighted and unbiased LTR). Third, when relevance grades are noisy, the $2^{\text{rel}}-1$ gain *amplifies* label error at the top grades, so a single mislabeled "perfect" document can dominate a query's gradient. ## Practical guidance and pitfalls A few hard-won lessons separate a working ranking system from a broken one. Always group by query. The single most common bug is forming document pairs across different queries. A document's relevance is only meaningful relative to other candidates for the *same* query; comparing a document from query A against one from query B is nonsense. Every pairwise and listwise method needs a query/group identifier, and metrics must be computed per query then averaged. In `xgboost` and `lightgbm` this is the `group` vector; forget it and the model silently trains on garbage pairs. Evaluate offline with the metric you care about, online with experiments. Offline NDCG/MAP on logged judgments is necessary but not sufficient. Logged data is biased: users only clicked what was *shown*, and what was shown was chosen by the old ranker (this is *position bias* and *presentation bias*). A model that scores well offline can lose in an A/B test. Treat offline metrics as a filter, not a verdict. Mind the gain and discount conventions. As noted, two NDCG formulas circulate. If your offline number disagrees with a library's, check whether both use the $2^{\text{rel}} - 1$ gain and the $\log_2(i+1)$ discount. Mismatched conventions are a frequent source of "my metric does not match the paper." Calibrate cutoffs to the surface. A search results page shows ten links; a mobile carousel shows three. Optimize and report NDCG@k for the $k$ that matches where users actually look. Reporting NDCG over all candidates can hide failures right where they hurt. ::: {.callout-warning} Graded-relevance labels are expensive and noisy. Human judges disagree, and "relevance" drifts with intent and time. Before investing in a listwise model, make sure your labels are good enough that the metric differences you are chasing are real and not label noise. ::: Pairwise count can explode. A query with $n$ candidates generates up to $O(n^2)$ pairs. With long candidate lists this dominates training time. Sampling pairs, capping candidates, or using listwise losses that avoid explicit pair enumeration are the usual remedies. Features matter more than the loss. In tabular ranking, careful query-document features (match scores, freshness, popularity, personalization signals) typically move NDCG more than the choice between pairwise and listwise. Spend effort there first. ## Further reading - Burges, Shaked, Renshaw, Lazier, Deeds, Hamilton, and Hullender (2005), "Learning to Rank Using Gradient Descent," introduces RankNet and the pairwise probabilistic loss. - Burges (2010), "From RankNet to LambdaRank to LambdaMART: An Overview," is the definitive walk-through of the lambda-gradient idea and LambdaMART; readable and self-contained. - Liu (2009), "Learning to Rank for Information Retrieval," is the standard survey organizing the pointwise, pairwise, and listwise families. - Cao, Qin, Liu, Tsai, and Li (2007), "Learning to Rank: From Pairwise Approach to Listwise Approach," introduces ListNet and motivates listwise training. - Järvelin and Kekäläinen (2002), "Cumulated Gain-Based Evaluation of IR Techniques," is the original source for DCG and NDCG. - Manning, Raghavan, and Schütze (2008), *Introduction to Information Retrieval*, gives the classic treatment of precision, recall, MAP, and MRR with worked examples. - Chen and Guestrin (2016), "XGBoost: A Scalable Tree Boosting System," documents the gradient-boosted ranking objectives used in the demonstration above.