Most of the models in this book are trained on one dataset to solve one prediction problem. Transfer learning and multi-task learning relax that assumption. Transfer learning reuses knowledge learned on a source problem to help a related target problem, usually one with less labeled data. Multi-task learning trains a single model on several related tasks at once so that the tasks share statistical strength. Both ideas rest on the same bet: if two problems are related, then the data for one carries information about the other, and a model that respects that relatedness can beat models trained in isolation.
This chapter explains where these methods fit in a modern workflow, derives the multi-task objective for the linear case, and gives a runnable base-R demonstration that compares a joint multi-task estimator to independent per-task models on simulated related tasks. By the end you should be able to say, for a given problem, whether to train tasks separately, pool them into one, or tie them together with a tunable penalty, and to recognize when sharing backfires.
Intuition
Think of two students studying for related exams. If they compare notes, each learns faster than working alone, as long as the subjects truly overlap. If the subjects are unrelated, swapping notes just adds confusion. Transfer and multi-task learning automate that judgment: share when sharing pays, and back off when it does not.
54.1 Intuition and workflow context
A data scientist rarely starts from zero. A churn model for a new product line can borrow from a churn model on an established line. A sentiment classifier for a niche domain can start from a general language model (Chapter 40). A demand forecast for a store that opened last month can borrow from stores that have years of history. In each case the target task has little data but a related source task has a lot, and we want to move usable structure from source to target.
It helps to fix notation. Let a task be indexed by \(t = 1, \dots, T\). Task \(t\) has data \(\{(x_{ti}, y_{ti})\}_{i=1}^{n_t}\) drawn from a distribution \(P_t(x, y)\). In standard supervised learning we fit one model per task by minimizing its own loss. Transfer and multi-task learning instead assume the \(P_t\) share structure, for example a common feature representation or parameter vectors that are close to one another, and they exploit that shared structure during fitting.
Two distinctions organize the field.
The first is the direction of reuse. Transfer learning is asymmetric: a source model is trained first, then adapted to a target. Multi-task learning is symmetric: all tasks are trained together and each is allowed to help the others.
The second is what gets shared. In deep models the dominant transfer pattern is to share a learned representation (the lower layers of a neural network, see Chapter 15) and keep task-specific heads on top.1 In classical models the shared object is more often the coefficient vector or a low-dimensional subspace that the coefficients live in.
Key idea
Transfer learning and multi-task learning are not separate algorithms so much as a stance: assume related problems share structure, then design the model so that structure can be reused. The rest of this chapter is about what to share and how strongly.
In a modern ML/AI workflow these methods show up at predictable points. When you pull a pretrained image or text encoder and adapt it, that is transfer learning. When you train one model to predict several related business outcomes from the same features, that is multi-task learning. When your training distribution differs from your serving distribution, the repair is domain adaptation, a special case of transfer where the label space is fixed but the input distribution shifts.
54.2 Feature extraction versus fine-tuning
The previous section set up the two big choices, direction of reuse and what gets shared. We now zoom in on the most common modern case: the shared object is a pretrained representation, and you want to adapt it to your target task. There are two ways to do that.
Feature extraction freezes the pretrained network and uses its output (or an intermediate layer) as a fixed feature vector. You train only a small new head, for example a logistic regression or a shallow dense layer, on those features. This is cheap, needs little target data, and cannot overfit the backbone because the backbone does not move.
Fine-tuning unfreezes some or all of the pretrained weights and continues training them on the target task, usually with a smaller learning rate so the pretrained structure is not destroyed. This is more expensive and needs more target data, but it lets the representation itself adapt to the target.
The choice is governed by how much target data you have and how similar the source and target are. Table 54.1 summarizes the standard guidance, following Yosinski et al. (2014) and the practical advice in Howard and Ruder (2018).
Table 54.1: Recommended adaptation strategy as a function of target data size and source-target similarity.
Target data
Source vs target similarity
Recommended approach
Small
Similar
Feature extraction, train head only
Small
Different
Feature extraction from earlier layers
Large
Similar
Fine-tune the whole network
Large
Different
Fine-tune, or train from scratch
When to use this
With only a few hundred labeled target examples, default to feature extraction: it has few parameters to estimate and cannot damage the backbone. Reach for fine-tuning only when target data are plentiful enough to move the backbone safely without overfitting.
A useful refinement is gradual unfreezing: start by training the head with the backbone frozen, then unfreeze the top blocks, then deeper blocks, each time with a lower learning rate. This protects the most general (lowest) layers, which tend to transfer across tasks, while letting the most specific (highest) layers adapt.
The following Keras sketch shows both patterns on an image backbone. It is not run here because it needs a GPU and downloaded weights, but it is correct idiomatic code.
Show code
library(keras)# Pretrained backbone without its classification headbase<-application_efficientnet_b0( include_top =FALSE, weights ="imagenet", input_shape =c(224, 224, 3), pooling ="avg")# --- Pattern 1: feature extraction (backbone frozen) ---freeze_weights(base)model<-keras_model_sequential()%>%base()%>%layer_dropout(0.2)%>%layer_dense(units =10, activation ="softmax")model%>%compile( optimizer =optimizer_adam(learning_rate =1e-3), loss ="categorical_crossentropy", metrics ="accuracy")# model %>% fit(train_ds, epochs = 5, validation_data = val_ds)# --- Pattern 2: fine-tuning (unfreeze top of backbone) ---unfreeze_weights(base, from ="block6a_expand_conv")model%>%compile( optimizer =optimizer_adam(learning_rate =1e-5), # small LR loss ="categorical_crossentropy", metrics ="accuracy")# model %>% fit(train_ds, epochs = 10, validation_data = val_ds)
54.3 Domain adaptation
Feature extraction and fine-tuning assume the source and target solve different tasks. The next pattern is the opposite: the task stays the same but the inputs come from a different distribution. Domain adaptation handles the case where the task is fixed but the input distribution moves. Let the source domain be \(P_S(x, y)\) and the target domain \(P_T(x, y)\). Covariate shift is the assumption that the conditional \(P(y \mid x)\) is the same in both domains while the marginal \(P(x)\) differs, so \(P_S(x) \neq P_T(x)\) but \(P_S(y \mid x) = P_T(y \mid x)\).
Under covariate shift the right correction is importance weighting. The target risk can be written as an expectation over the source distribution reweighted by the density ratio \(w(x) = P_T(x) / P_S(x)\):
Importance weighting says: trust source points that look like target points, and discount source points that do not. A source example sitting in a region the target rarely visits gets a small weight, so the fitted model concentrates on the part of input space you actually care about at serving time.
So you fit the model on source data but weight each source point by how likely it is under the target marginal. Estimating \(w(x)\) directly by density estimation is hard, so in practice you train a classifier to distinguish source from target inputs and turn its probability into the weight, as in Sugiyama, Krauledat, and Muller (2007).
54.3.1 Why the identity holds, and what it costs
The reweighting identity is a one-line change of measure, valid whenever \(P_S(x) > 0\) wherever \(P_T(x) > 0\) (the support condition, without which the ratio is undefined and importance weighting cannot work):
where the second equality multiplies and divides by \(P_S(x)\) and uses \(P_T(y\mid x) = P_S(y\mid x)\), the covariate-shift assumption. The right-hand side is \(\mathbb{E}_{P_S}[w(x)\,\ell(f(x),y)]\), so the weighted empirical risk \(\hat R_w(f) = \frac{1}{n_S}\sum_{i=1}^{n_S} w(x_i)\,\ell(f(x_i), y_i)\) is an unbiased and (under standard conditions) consistent estimator of the target risk. The crucial caveat is that the conditional \(P(y\mid x)\) must genuinely be invariant. If the label rule itself shifts (concept drift), reweighting the inputs corrects the wrong thing and can make matters worse.
Unbiasedness is not free. The weighted estimator has variance inflated by the spread of the weights. Its effective sample size is
where \(\widehat{\operatorname{CV}}^2(w)\) is the squared coefficient of variation of the weights. When source and target overlap poorly, a few points carry almost all the weight, \(\operatorname{CV}^2(w)\) explodes, and \(n_{\text{eff}}\) collapses far below \(n_S\): the correction is unbiased but useless. This is why practitioners clip or self-normalize the weights, trading a little bias for a large reduction in variance, and why Equation 54.1 is the right diagnostic to monitor before trusting an importance-weighted fit. When representations are learned, an alternative is to align the source and target feature distributions during training, for example by matching their means and covariances or by an adversarial domain classifier, as in Ganin et al. (2016).
54.4 Parameter sharing: hard and soft
Domain adaptation reused a representation across input distributions. We now turn to the symmetric, multi-task side: training several tasks together. When tasks are trained together, the architecture decides how parameters are shared. There are two canonical schemes, described in Ruder (2017).
Hard parameter sharing uses one shared body for all tasks and a small task-specific head per task. The shared body is forced to learn a representation that serves every task. This is the most common multi-task setup, it is parameter-efficient, and the shared body acts as a strong regularizer because it must fit all tasks at once.
Soft parameter sharing gives each task its own full set of parameters but adds a penalty that keeps the parameters of different tasks close. Nothing is literally shared; the coupling is through the penalty. This is more flexible when tasks are only loosely related, at the cost of more parameters.
Note
Hard sharing forces tasks to use exactly the same body, so the coupling is total and not tunable. Soft sharing keeps separate parameters and tunes the coupling through a penalty, so you can dial sharing from none to near-total. That tunability is what makes the soft case the right one to study in detail.
The base-R demonstration below is an instance of soft sharing in its purest linear form. Each task has its own coefficient vector, and a penalty pulls the vectors toward a common center.
54.5 Multi-task linear model: derivation
Consider \(T\) linear regression tasks. Task \(t\) has design matrix \(X_t \in \mathbb{R}^{n_t \times p}\), response \(y_t \in \mathbb{R}^{n_t}\), and coefficient vector \(\beta_t \in \mathbb{R}^p\). Independent ordinary least squares solves \(T\) separate problems:
\[
\hat\beta_t^{\,\text{ind}}
= \arg\min_{\beta_t} \; \lVert y_t - X_t \beta_t \rVert_2^2,
\qquad t = 1, \dots, T.
\]
When the tasks are related, their true coefficient vectors are close to a shared center. We model this with a decomposition \(\beta_t = \beta_0 + v_t\), where \(\beta_0\) is a common vector shared by all tasks and \(v_t\) is a small task-specific deviation. We then penalize the deviations.
Intuition
The split \(\beta_t = \beta_0 + v_t\) says each task is the shared answer plus a small personal correction. Penalizing only \(v_t\) keeps the corrections small unless the data really demand them, which is exactly the “borrow by default, deviate when justified” behavior we want.
The penalty \(\lambda\) controls how much the tasks are tied together. As \(\lambda \to 0\) each task is free and we recover independent fits (each task absorbs everything into its own \(v_t\)). As \(\lambda \to \infty\) all deviations are forced to zero, \(\beta_t = \beta_0\) for every task, and we recover a single pooled model fit on the stacked data. Intermediate \(\lambda\) interpolates between these two extremes, which is exactly the multi-task regime: borrow strength across tasks without forcing them to be identical. This is the regularized multi-task formulation of Evgeniou and Pontil (2004).
The objective is a single quadratic in the stacked parameter vector \(\theta = (\beta_0, v_1, \dots, v_T)\), so it has a closed form. Build a block design matrix \(Z\) where the columns for \(\beta_0\) repeat \(X_t\) across all task rows and the columns for \(v_t\) contain \(X_t\) only on task \(t\)’s rows. Stack all responses into \(y\). With a diagonal penalty matrix \(P\) that penalizes only the \(v_t\) blocks (entries \(\lambda\)) and not \(\beta_0\) (entries \(0\)), the solution is ridge-like:
\[
\hat\theta = (Z^\top Z + P)^{-1} Z^\top y .
\]
Note
This is just ridge regression on a cleverly laid-out design matrix. The only twist is that the penalty matrix \(P\) leaves the shared block \(\beta_0\) unpenalized and applies \(\lambda\) only to the deviation blocks \(v_t\). Once you see that, the multi-task estimator is no harder to compute than any other ridge fit.
Each task estimate is then \(\hat\beta_t = \hat\beta_0 + \hat v_t\). The next section builds exactly this \(Z\) and \(P\) in base R.
54.5.1 Deriving the closed form from the normal equations
The block formulation above hides the structure. It is worth deriving the solution directly, because the stationarity conditions expose exactly how the shared center and the deviations are coupled. Write the objective as
The deviations sum to zero at the optimum: the shared center is the point from which the task-specific corrections balance out, which is the precise sense in which \(\beta_0\) is a “center.” Substituting \(\hat v_t = A_t(y_t - X_t\beta_0)\) into Equation 54.3 yields a single linear system for \(\beta_0\),
after which each \(\hat v_t\) follows in closed form. This is algebraically identical to the block solution \(\hat\theta = (Z^\top Z + P)^{-1} Z^\top y\), but it makes the mechanism transparent: profiling out the deviations leaves a \(p \times p\) system in the shared center, so the multi-task fit costs essentially one ridge solve plus \(T\) small ridge solves, not one dense \((T+1)p\) solve, when implemented carefully.
54.5.2 The ridge-on-deviations as a Gaussian prior
The penalty \(\lambda \sum_t \lVert v_t \rVert_2^2\) is not arbitrary. It is the negative log of a hierarchical Gaussian prior. Take
which is exactly the joint objective with \(\lambda = \sigma^2 / \tau^2\). The sharing strength is therefore an estimate of the noise-to-heterogeneity ratio. Small \(\tau^2\) (tasks tightly clustered around the center) sends \(\lambda \to \infty\) and recovers pooling; large \(\tau^2\) (tasks free to roam) sends \(\lambda \to 0\) and recovers independent fits. This is the formal content of the empirical observation that the best \(\lambda\) is interior whenever the tasks are related but not identical: \(\tau^2\) is finite and positive.
Note
The hierarchical-prior reading turns “tune \(\lambda\) by cross-validation” into “estimate the heterogeneity \(\tau^2\).” With many tasks one can estimate \(\sigma^2\) and \(\tau^2\) directly by marginal (empirical Bayes) likelihood and set \(\lambda = \hat\sigma^2 / \hat\tau^2\), which is the multi-task analogue of James-Stein shrinkage and avoids a cross-validation grid entirely.
54.5.3 Shrinkage, bias, and variance in the orthonormal case
To see the bias-variance trade-off in closed form, specialize to balanced orthonormal designs: \(n_t = n\) and \(X_t^\top X_t = n I\) for every task. Then \(A_t = (n I + \lambda I)^{-1} X_t^\top\), and by symmetry the shared center is the average of the per-task OLS solutions. Writing \(\hat\beta_t^{\text{ols}} = (X_t^\top X_t)^{-1} X_t^\top y_t\) and \(\bar\beta^{\text{ols}} = \tfrac{1}{T}\sum_t \hat\beta_t^{\text{ols}}\), the multi-task estimate becomes a convex combination
The estimator shrinks each task’s OLS solution toward the cross-task mean by a fraction \(\alpha\) that grows with \(\lambda\) and shrinks with sample size \(n\). This is the exact analogue of ridge shrinkage and of James-Stein toward a common mean. With \(\beta_t = \beta_0 + \delta_t\) for true deviations \(\delta_t\) and per-coordinate OLS variance \(\sigma^2/n\), the per-task mean squared error of Equation 54.4 decomposes (treating the mean as approximately unbiased for \(\beta_0\) when \(T\) is large) as
Three readings of this formula are worth stating. First, more noise or smaller samples (\(\sigma^2/n\) large) push \(\alpha^\star\) toward 1: borrow more. Second, more genuine heterogeneity (\(\lVert\delta_t\rVert_2^2\) large) pushes \(\alpha^\star\) toward 0: borrow less. Third, \(\alpha^\star\) is strictly interior whenever both terms are positive and finite, which is the algebraic reason the U-shaped curve in the demonstration has an interior minimum. The formula also recovers the failure modes: when the tasks are unrelated, \(\lVert\delta_t\rVert_2^2 \to \infty\) forces \(\alpha^\star \to 0\) and any positive \(\lambda\) is harmful, which is negative transfer in a single equation.
The convex-combination identity Equation 54.4 is easy to confirm numerically. The check below builds orthonormal designs so that \(X_t^\top X_t = n I\) exactly, solves the full block system, and compares the result to the shrinkage formula.
Show code
set.seed(7)T_chk<-4; p_chk<-3; n_chk<-20; lam<-5# Orthonormal designs: X_t^T X_t = n I via scaled orthonormal columns.mk_orth<-function(){Q<-qr.Q(qr(matrix(rnorm(n_chk*p_chk), n_chk, p_chk)))Q*sqrt(n_chk)# columns now have squared norm n}Xs<-replicate(T_chk, mk_orth(), simplify =FALSE)bt<-replicate(T_chk, rnorm(p_chk), simplify =FALSE)ys<-Map(function(X, b)as.numeric(X%*%b)+rnorm(n_chk), Xs, bt)# Block solve (same construction as fit_mtl).N<-T_chk*n_chkZ<-matrix(0, N, p_chk*(T_chk+1)); yv<-numeric(N); r0<-0for(tin1:T_chk){rows<-(r0+1):(r0+n_chk)Z[rows, 1:p_chk]<-Xs[[t]]Z[rows, (t*p_chk+1):((t+1)*p_chk)]<-Xs[[t]]yv[rows]<-ys[[t]]; r0<-r0+n_chk}pen<-c(rep(0, p_chk), rep(lam, p_chk*T_chk))theta<-solve(crossprod(Z)+diag(pen), crossprod(Z, yv))b0<-theta[1:p_chk]beta_block<-lapply(1:T_chk, function(t)b0+theta[(t*p_chk+1):((t+1)*p_chk)])# Shrinkage formula: convex combo of per-task OLS and their mean.ols<-Map(function(X, y)solve(crossprod(X), crossprod(X, y)), Xs, ys)obar<-Reduce(`+`, ols)/T_chkalpha<-lam/(n_chk+lam)beta_form<-lapply(ols, function(b)(1-alpha)*b+alpha*obar)max_abs_diff<-max(abs(unlist(beta_block)-unlist(beta_form)))cat("max |block - shrinkage formula| =", signif(max_abs_diff, 3), "\n")#> max |block - shrinkage formula| = 1.33e-15
The discrepancy is at the level of floating-point error, confirming that in the orthonormal case the block estimator is exactly the shrinkage of each task’s OLS toward the cross-task mean with weight \(\alpha = \lambda / (n + \lambda)\).
54.6 Runnable demonstration
We simulate \(T\) related regression tasks. Each task’s true coefficients equal a shared vector plus a small random deviation, so the tasks are related but not identical. Each task gets only a modest sample, the situation where borrowing strength should help. We compare three estimators: independent OLS per task, a single pooled model, and the joint multi-task estimator across a grid of \(\lambda\). We evaluate by out-of-sample mean squared error against the known truth.
Show code
set.seed(2026)T_tasks<-6# number of tasksp<-5# number of predictors per taskn_train<-25# small training sample per taskn_test<-500# large test sample per task# Shared coefficient center plus small task-specific deviations.beta0_true<-c(1.5, -2.0, 0.0, 1.0, -0.5)sigma_dev<-0.4# how far tasks drift from the shared centersigma_eps<-1.0# noise standard deviation# True per-task coefficients: beta_t = beta0 + deviation_tbeta_true<-lapply(1:T_tasks, function(t){beta0_true+rnorm(p, mean =0, sd =sigma_dev)})make_task<-function(t, n){X<-matrix(rnorm(n*p), nrow =n, ncol =p)y<-as.numeric(X%*%beta_true[[t]])+rnorm(n, sd =sigma_eps)list(X =X, y =y)}train<-lapply(1:T_tasks, function(t)make_task(t, n_train))test<-lapply(1:T_tasks, function(t)make_task(t, n_test))
Independent OLS, fit one task at a time.
Show code
fit_independent<-function(task){# No intercept: data are centered around zero by construction.solve(crossprod(task$X), crossprod(task$X, task$y))}beta_ind<-lapply(train, fit_independent)
The pooled model stacks all tasks and fits a single coefficient vector, the \(\lambda \to \infty\) limit of the joint objective.
Now the joint multi-task estimator. We build the block matrix \(Z\) and the penalty \(P\) described in the derivation, then solve the ridge-like system.
Show code
fit_mtl<-function(train, lambda, p, T_tasks){n_t<-sapply(train, function(z)length(z$y))N<-sum(n_t)# Z has p shared columns followed by p columns per task.Z<-matrix(0, nrow =N, ncol =p*(T_tasks+1))y<-numeric(N)row0<-0for(tin1:T_tasks){rows<-(row0+1):(row0+n_t[t])Z[rows, 1:p]<-train[[t]]$X# shared block beta0cols<-(t*p+1):((t+1)*p)# deviation block v_tZ[rows, cols]<-train[[t]]$Xy[rows]<-train[[t]]$yrow0<-row0+n_t[t]}# Penalty: 0 on the shared block, lambda on every deviation block.pen<-c(rep(0, p), rep(lambda, p*T_tasks))P<-diag(pen)theta<-solve(crossprod(Z)+P, crossprod(Z, y))beta0<-theta[1:p]betas<-lapply(1:T_tasks, function(t){beta0+theta[(t*p+1):((t+1)*p)]})list(beta0 =beta0, betas =betas)}
A common test-MSE helper, evaluated against held-out data per task.
Reading the printed table: the multi-task estimator at its best \(\lambda\) has the lowest mean test MSE, beating both independent OLS (too noisy, because each task fits only 25 points) and pooling (too biased, because the tasks are not identical). The reported best \(\lambda\) is an interior value, neither near zero nor near infinity, which is the signature of genuine borrowing.
Figure 54.1 shows how multi-task test error varies with \(\lambda\), with the independent and pooled baselines as horizontal references. The U shape is the bias-variance trade-off across the sharing strength: too little sharing behaves like the noisy independent fits, too much sharing behaves like the biased pooled fit, and the minimum sits in between.
Show code
library(ggplot2)df<-data.frame(lambda =lambda_grid, mse =mse_mtl)ggplot(df, aes(x =lambda, y =mse))+geom_line(color ="steelblue", linewidth =1)+geom_point(color ="steelblue")+geom_hline(yintercept =mse_ind, linetype ="dashed", color ="firebrick")+geom_hline(yintercept =mse_pool, linetype ="dotted", color ="darkgreen")+annotate("point", x =best_lambda, y =best_mse, color ="black", size =3)+scale_x_log10()+annotate("text", x =min(lambda_grid), y =mse_ind, label ="Independent OLS", color ="firebrick", hjust =0, vjust =-0.6, size =3.3)+annotate("text", x =min(lambda_grid), y =mse_pool, label ="Pooled", color ="darkgreen", hjust =0, vjust =-0.6, size =3.3)+labs(x =expression(lambda~"(log scale)"), y ="Mean test MSE across tasks", title ="Borrowing strength across related tasks")+theme_minimal(base_size =12)
Figure 54.1: Multi-task test MSE versus sharing strength, against independent and pooled baselines.
With small per-task samples and genuinely related tasks, the multi-task estimator at its best \(\lambda\) should beat both baselines: it avoids the high variance of independent OLS and the bias of pooling. If you increase n_train substantially, the independent fits improve and the gap shrinks, because each task no longer needs to borrow. If you increase sigma_dev so the tasks drift far apart, pooling and small-\(\lambda\) multi-task degrade, which is negative transfer, discussed next.
Tip
Try editing n_train, sigma_dev, and sigma_eps and rerunning. Watching the U-shaped curve flatten, deepen, or invert builds far more intuition than any single number in the table. The whole point of a runnable demonstration is that the trade-off is yours to probe.
54.7 Negative transfer
Sharing helps only when tasks are actually related. When they are not, forcing them to share hurts, and the shared model does worse than independent models. This failure is called negative transfer, surveyed in Pan and Yang (2010) and Zhang and Yang (2021).
Warning
Negative transfer is not a rare edge case; it is the default outcome when you assume relatedness without checking it. The discipline that protects you is simple: always keep an independent-model baseline and refuse to ship a shared model that loses to it on a clean target holdout.
You can see it directly in the demonstration. The pooled model is the \(\lambda \to \infty\) limit, and when sigma_dev is large the pooled MSE rises above the independent MSE: the tasks are too different to share a single coefficient vector. The remedy is to let \(\lambda\) be chosen by validation rather than fixed, so the data decide how much sharing is appropriate, and to let it go small when sharing does not pay.
Practical signals of negative transfer: a multi-task model whose per-task metrics are worse than separately trained models; one dominant task whose gradients swamp the others; or source and target distributions that look unrelated under a two-sample test. The defenses are task weighting (next section), grouping only tasks that are similar, and architectures that share less (soft sharing, or task-specific layers). Closely related is ensemble learning (Chapter 57), where combining diverse models is preferred to forcing one model to serve mismatched objectives.
54.8 Task weighting
When tasks are trained jointly the total loss is a weighted sum,
and the weights \(w_t\) matter.2 If one task has a much larger loss scale or many more examples, it dominates the gradient and the others are under-fit. The simplest fix is to normalize each task’s loss to a comparable scale. Beyond that, two learned schemes are common.
Uncertainty weighting (Kendall, Gal, and Cipolla (2018)) treats each task’s noise level as a parameter \(\sigma_t\) and derives weights from a Gaussian likelihood. For regression tasks the objective becomes
so a task with high uncertainty gets a small weight \(1 / (2\sigma_t^2)\), and the \(\log \sigma_t\) term prevents the weights from collapsing to zero. The \(\sigma_t\) are learned by gradient descent alongside the model.
To see where this objective comes from, model each regression task as Gaussian with its own observation noise, \(y_t \sim \mathcal{N}(f_t(x), \sigma_t^2)\). The negative log likelihood of one task, dropping the constant \(\frac12\log 2\pi\), is
and summing over tasks (which assumes conditional independence across tasks given the shared representation) gives exactly the displayed objective with \(\mathcal{L}_t = \lVert y_t - f_t(x)\rVert_2^2\). The weighting is therefore not a heuristic; it is maximum likelihood for a heteroscedastic multi-task Gaussian model in which each task is allowed its own noise scale. Profiling out \(\sigma_t\) makes the mechanism explicit: setting \(\partial \mathcal{L} / \partial \sigma_t = 0\) gives
so at the optimum the learned noise variance equals the task’s own residual loss, and back-substituting yields a profiled objective \(\sum_t \big(\tfrac12 + \tfrac12\log \mathcal{L}_t\big)\), that is, a sum of log losses. The practical effect is automatic, scale-free balancing: a task with a large irreducible loss is downweighted in proportion to that loss, so no task dominates merely because it is measured in larger units or is intrinsically noisier. (In practice one optimizes \(s_t = \log \sigma_t^2\) rather than \(\sigma_t\) for numerical stability and to keep the variance positive.)
Gradient normalization (Chen et al. (2018)) instead adjusts the weights so that the tasks train at similar rates, by balancing the magnitudes of their gradients on the shared parameters. Both methods aim at the same goal: keep any single task from dominating, which is one of the main causes of negative transfer.
54.9 Practical guidance and pitfalls
When to reach for these methods:
The target task has little labeled data but a related, data-rich source exists. Transfer learning is the default starting point.
You must predict several related outcomes from the same inputs. A multi-task model can beat separate models and is cheaper to serve.
Your training and serving input distributions differ. Domain adaptation, through importance weighting or representation alignment, is the repair.
Pitfalls to watch:
Assuming relatedness. Verify it. Check whether sharing actually improves held-out per-task metrics versus independent baselines. If it does not, you have negative transfer.
Data leakage during transfer. If you select the source model or tune \(\lambda\) using the target test set, your reported gains are optimistic. Keep a clean target holdout.
Distribution shift in the source. A pretrained backbone encodes its source distribution. If the target inputs differ sharply (different sensors, languages, time periods), early-layer features may not transfer.
Catastrophic forgetting during fine-tuning. Too high a learning rate erases the pretrained structure. Use small learning rates and gradual unfreezing.
Imbalanced tasks. Normalize loss scales and consider learned task weighting so that large or noisy tasks do not dominate.
Tuning the sharing strength on the wrong signal. In the linear model, pick \(\lambda\) by cross-validation on the target, not by training error, which always prefers small \(\lambda\).
A reasonable default recipe: start with feature extraction or a moderately penalized multi-task model, establish an independent-model baseline, tune the sharing strength on validation data, and only move to full fine-tuning or strong sharing if the data are abundant and the relatedness is confirmed.
Key idea
Every method in this chapter is a knob on a single dial, from fully independent models to a single shared model. Transfer, multi-task, domain adaptation, and task weighting are all ways of choosing, and defending, where to set that dial. Let validation data, not optimism, make the choice.
54.10 Further reading
Caruana (1997), the foundational treatment of multi-task learning as inductive transfer.
Evgeniou and Pontil (2004), regularized multi-task learning, the formulation used in this chapter’s demonstration.
Pan and Yang (2010), a broad survey of transfer learning.
Yosinski, Clune, Bengio, and Lipson (2014), on how transferable features are across layers of deep networks.
Ganin et al. (2016), domain-adversarial training for domain adaptation.
Ruder (2017), an overview of multi-task learning in deep networks, including hard and soft parameter sharing.
Kendall, Gal, and Cipolla (2018), uncertainty-based task weighting.
Chen et al. (2018), GradNorm for gradient-based task balancing.
Howard and Ruder (2018), discriminative fine-tuning and gradual unfreezing.
Zhang and Yang (2021), a recent survey of multi-task learning methods.
Chen, Zhao, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. 2018. “GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks.” In International Conference on Machine Learning (ICML).
Evgeniou, Theodoros, and Massimiliano Pontil. 2004. “Regularized Multi-Task Learning.” In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).
Ganin, Yaroslav, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Francois Laviolette, Mario Marchand, and Victor Lempitsky. 2016. “Domain-Adversarial Training of Neural Networks.”Journal of Machine Learning Research 17 (59): 1–35.
Howard, Jeremy, and Sebastian Ruder. 2018. “Universal Language Model Fine-Tuning for Text Classification.” In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
Kendall, Alex, Yarin Gal, and Roberto Cipolla. 2018. “Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Pan, Sinno Jialin, and Qiang Yang. 2010. “A Survey on Transfer Learning.”IEEE Transactions on Knowledge and Data Engineering 22 (10): 1345–59.
Ruder, Sebastian. 2017. “An Overview of Multi-Task Learning in Deep Neural Networks.”arXiv Preprint arXiv:1706.05098.
Sugiyama, Masashi, Matthias Krauledat, and Klaus-Robert Muller. 2007. “Covariate Shift Adaptation by Importance Weighted Cross Validation.”Journal of Machine Learning Research 8: 985–1005.
Yosinski, Jason, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. “How Transferable Are Features in Deep Neural Networks?” In Advances in Neural Information Processing Systems (NeurIPS).
Zhang, Yu, and Qiang Yang. 2021. “A Survey on Multi-Task Learning.”IEEE Transactions on Knowledge and Data Engineering 34 (12): 5586–5609.
A head is the small task-specific output layer that sits on top of a shared backbone or body. The backbone turns raw inputs into features; the head maps features to that task’s predictions.↩︎
Setting every \(w_t = 1\) looks neutral but is not: it implicitly weights each task by its raw loss scale, so a task measured in larger units silently dominates.↩︎
# Transfer and Multi-Task Learning {#sec-transfer-multitask-learning}```{r}#| include: falsesource("_common.R")``````{r setup-tml, include=FALSE}knitr::opts_chunk$set(echo =TRUE)```Most of the models in this book are trained on one dataset to solve oneprediction problem. Transfer learning and multi-task learning relax thatassumption. Transfer learning reuses knowledge learned on a source problemto help a related target problem, usually one with less labeled data.Multi-task learning trains a single model on several related tasks at onceso that the tasks share statistical strength. Both ideas rest on the samebet: if two problems are related, then the data for one carries informationabout the other, and a model that respects that relatedness can beat modelstrained in isolation.This chapter explains where these methods fit in a modern workflow, derivesthe multi-task objective for the linear case, and gives a runnable base-Rdemonstration that compares a joint multi-task estimator to independentper-task models on simulated related tasks. By the end you should be able tosay, for a given problem, whether to train tasks separately, pool them intoone, or tie them together with a tunable penalty, and to recognize whensharing backfires.::: {.callout-tip title="Intuition"}Think of two students studying for related exams. If theycompare notes, each learns faster than working alone, as long as thesubjects truly overlap. If the subjects are unrelated, swapping notes justadds confusion. Transfer and multi-task learning automate that judgment:share when sharing pays, and back off when it does not.:::## Intuition and workflow contextA data scientist rarely starts from zero. A churn model for a new productline can borrow from a churn model on an established line. A sentimentclassifier for a niche domain can start from a general language model(@sec-llms). Ademand forecast for a store that opened last month can borrow from storesthat have years of history. In each case the target task has little data buta related source task has a lot, and we want to move usable structure fromsource to target.It helps to fix notation. Let a task be indexed by $t = 1, \dots, T$. Task$t$ has data $\{(x_{ti}, y_{ti})\}_{i=1}^{n_t}$ drawn from a distribution$P_t(x, y)$. In standard supervised learning we fit one model per task byminimizing its own loss. Transfer and multi-task learning instead assume the$P_t$ share structure, for example a common feature representation orparameter vectors that are close to one another, and they exploit thatshared structure during fitting.Two distinctions organize the field.The first is the direction of reuse. *Transfer learning* is asymmetric: asource model is trained first, then adapted to a target. *Multi-tasklearning* is symmetric: all tasks are trained together and each is allowedto help the others.The second is what gets shared. In deep models the dominant transfer patternis to share a learned representation (the lower layers of a neural network, see@sec-neural-networks) and keeptask-specific heads on top.^[A *head* is the small task-specific output layerthat sits on top of a shared *backbone* or *body*. The backbone turns rawinputs into features; the head maps features to that task's predictions.] Inclassical models the shared object is more often the coefficient vector or alow-dimensional subspace that the coefficients live in.::: {.callout-important title="Key idea"}Transfer learning and multi-task learning are not separatealgorithms so much as a stance: assume related problems share structure,then design the model so that structure can be reused. The rest of thischapter is about *what* to share and *how strongly*.:::In a modern ML/AI workflow these methods show up at predictable points. Whenyou pull a pretrained image or text encoder and adapt it, that is transferlearning. When you train one model to predict several related businessoutcomes from the same features, that is multi-task learning. When yourtraining distribution differs from your serving distribution, the repair isdomain adaptation, a special case of transfer where the label space is fixedbut the input distribution shifts.## Feature extraction versus fine-tuningThe previous section set up the two big choices, direction of reuse and whatgets shared. We now zoom in on the most common modern case: the shared objectis a pretrained representation, and you want to adapt it to your target task.There are two ways to do that.*Feature extraction* freezes the pretrained network and uses its output (oran intermediate layer) as a fixed feature vector. You train only a small newhead, for example a logistic regression or a shallow dense layer, on thosefeatures. This is cheap, needs little target data, and cannot overfit thebackbone because the backbone does not move.*Fine-tuning* unfreezes some or all of the pretrained weights and continuestraining them on the target task, usually with a smaller learning rate so thepretrained structure is not destroyed. This is more expensive and needs moretarget data, but it lets the representation itself adapt to the target.The choice is governed by how much target data you have and how similar thesource and target are. @tbl-transfer-multitask-learning-adapt-guidancesummarizes the standard guidance, following @Yosinski_2014 and the practicaladvice in @Howard_2018.| Target data | Source vs target similarity | Recommended approach ||---|---|---|| Small | Similar | Feature extraction, train head only || Small | Different | Feature extraction from earlier layers || Large | Similar | Fine-tune the whole network || Large | Different | Fine-tune, or train from scratch |: Recommended adaptation strategy as a function of target data size and source-target similarity. {#tbl-transfer-multitask-learning-adapt-guidance}::: {.callout-tip title="When to use this"}With only a few hundred labeled target examples,default to feature extraction: it has few parameters to estimate and cannotdamage the backbone. Reach for fine-tuning only when target data areplentiful enough to move the backbone safely without overfitting.:::A useful refinement is *gradual unfreezing*: start by training the head withthe backbone frozen, then unfreeze the top blocks, then deeper blocks, eachtime with a lower learning rate. This protects the most general (lowest)layers, which tend to transfer across tasks, while letting the most specific(highest) layers adapt.The following Keras sketch shows both patterns on an image backbone. It isnot run here because it needs a GPU and downloaded weights, but it is correctidiomatic code.```{r keras-finetune, eval=FALSE}library(keras)# Pretrained backbone without its classification headbase <-application_efficientnet_b0(include_top =FALSE,weights ="imagenet",input_shape =c(224, 224, 3),pooling ="avg")# --- Pattern 1: feature extraction (backbone frozen) ---freeze_weights(base)model <-keras_model_sequential() %>%base() %>%layer_dropout(0.2) %>%layer_dense(units =10, activation ="softmax")model %>%compile(optimizer =optimizer_adam(learning_rate =1e-3),loss ="categorical_crossentropy",metrics ="accuracy")# model %>% fit(train_ds, epochs = 5, validation_data = val_ds)# --- Pattern 2: fine-tuning (unfreeze top of backbone) ---unfreeze_weights(base, from ="block6a_expand_conv")model %>%compile(optimizer =optimizer_adam(learning_rate =1e-5), # small LRloss ="categorical_crossentropy",metrics ="accuracy")# model %>% fit(train_ds, epochs = 10, validation_data = val_ds)```## Domain adaptationFeature extraction and fine-tuning assume the source and target solvedifferent tasks. The next pattern is the opposite: the task stays the same butthe inputs come from a different distribution. Domain adaptation handles thecase where the task is fixed but the input distribution moves. Let the source domain be $P_S(x, y)$ and the targetdomain $P_T(x, y)$. *Covariate shift* is the assumption that the conditional$P(y \mid x)$ is the same in both domains while the marginal $P(x)$ differs,so $P_S(x) \neq P_T(x)$ but $P_S(y \mid x) = P_T(y \mid x)$.Under covariate shift the right correction is importance weighting. Thetarget risk can be written as an expectation over the source distributionreweighted by the density ratio $w(x) = P_T(x) / P_S(x)$:$$\mathbb{E}_{P_T}[\ell(f(x), y)]= \mathbb{E}_{P_S}\!\left[ \frac{P_T(x)}{P_S(x)}\, \ell(f(x), y) \right].$$::: {.callout-tip title="Intuition"}Importance weighting says: trust source points that looklike target points, and discount source points that do not. A sourceexample sitting in a region the target rarely visits gets a small weight, sothe fitted model concentrates on the part of input space you actually careabout at serving time.:::So you fit the model on source data but weight each source point by howlikely it is under the target marginal. Estimating $w(x)$ directly bydensity estimation is hard, so in practice you train a classifier todistinguish source from target inputs and turn its probability into theweight, as in @Sugiyama_2007.### Why the identity holds, and what it costsThe reweighting identity is a one-line change of measure, valid whenever$P_S(x) > 0$ wherever $P_T(x) > 0$ (the support condition, without which theratio is undefined and importance weighting cannot work):$$\mathbb{E}_{P_T}[\ell(f(x), y)]= \int \ell(f(x), y)\, P_T(x)\, P_T(y \mid x)\, dx\, dy= \int \ell(f(x), y)\, \frac{P_T(x)}{P_S(x)}\, P_S(x)\, P_S(y \mid x)\, dx\, dy ,$$where the second equality multiplies and divides by $P_S(x)$ and uses$P_T(y\mid x) = P_S(y\mid x)$, the covariate-shift assumption. The right-handside is $\mathbb{E}_{P_S}[w(x)\,\ell(f(x),y)]$, so the weighted empirical risk$\hat R_w(f) = \frac{1}{n_S}\sum_{i=1}^{n_S} w(x_i)\,\ell(f(x_i), y_i)$ is anunbiased and (under standard conditions) consistent estimator of the targetrisk. The crucial caveat is that the conditional $P(y\mid x)$ must genuinely beinvariant. If the label rule itself shifts (concept drift), reweighting theinputs corrects the wrong thing and can make matters worse.Unbiasedness is not free. The weighted estimator has variance inflated by thespread of the weights. Its effective sample size is$$n_{\text{eff}}= \frac{\left(\sum_i w(x_i)\right)^2}{\sum_i w(x_i)^2}= \frac{n_S}{1 + \widehat{\operatorname{CV}}^2(w)} ,$$ {#eq-transfer-multitask-learning-neff}where $\widehat{\operatorname{CV}}^2(w)$ is the squared coefficient of variationof the weights. When source and target overlap poorly, a few points carryalmost all the weight, $\operatorname{CV}^2(w)$ explodes, and $n_{\text{eff}}$collapses far below $n_S$: the correction is unbiased but useless. This is whypractitioners clip or self-normalize the weights, trading a little bias for alarge reduction in variance, and why @eq-transfer-multitask-learning-neff isthe right diagnostic to monitor before trusting an importance-weighted fit. When representations are learned, analternative is to align the source and target feature distributions duringtraining, for example by matching their means and covariances or by anadversarial domain classifier, as in @Ganin_2016.## Parameter sharing: hard and softDomain adaptation reused a representation across input distributions. We nowturn to the symmetric, multi-task side: training several tasks together. Whentasks are trained together, the architecture decides how parameters areshared. There are two canonical schemes, described in @Ruder_2017.*Hard parameter sharing* uses one shared body for all tasks and a smalltask-specific head per task. The shared body is forced to learn arepresentation that serves every task. This is the most common multi-tasksetup, it is parameter-efficient, and the shared body acts as a strongregularizer because it must fit all tasks at once.*Soft parameter sharing* gives each task its own full set of parameters butadds a penalty that keeps the parameters of different tasks close. Nothing isliterally shared; the coupling is through the penalty. This is more flexiblewhen tasks are only loosely related, at the cost of more parameters.::: {.callout-note}Hard sharing forces tasks to use exactly the same body, so thecoupling is total and not tunable. Soft sharing keeps separate parametersand tunes the coupling through a penalty, so you can dial sharing from noneto near-total. That tunability is what makes the soft case the right one tostudy in detail.:::The base-R demonstration below is an instance of soft sharing in its purestlinear form. Each task has its own coefficient vector, and a penalty pullsthe vectors toward a common center.## Multi-task linear model: derivationConsider $T$ linear regression tasks. Task $t$ has design matrix$X_t \in \mathbb{R}^{n_t \times p}$, response $y_t \in \mathbb{R}^{n_t}$, andcoefficient vector $\beta_t \in \mathbb{R}^p$. Independent ordinary leastsquares solves $T$ separate problems:$$\hat\beta_t^{\,\text{ind}}= \arg\min_{\beta_t} \; \lVert y_t - X_t \beta_t \rVert_2^2,\qquad t = 1, \dots, T.$$When the tasks are related, their true coefficient vectors are close to ashared center. We model this with a decomposition $\beta_t = \beta_0 + v_t$,where $\beta_0$ is a common vector shared by all tasks and $v_t$ is a smalltask-specific deviation. We then penalize the deviations.::: {.callout-tip title="Intuition"}The split $\beta_t = \beta_0 + v_t$ says each task is theshared answer plus a small personal correction. Penalizing only $v_t$ keepsthe corrections small unless the data really demand them, which is exactlythe "borrow by default, deviate when justified" behavior we want.:::The joint objective is$$\min_{\beta_0,\, v_1, \dots, v_T}\; \sum_{t=1}^{T} \lVert y_t - X_t (\beta_0 + v_t) \rVert_2^2\;+\; \lambda \sum_{t=1}^{T} \lVert v_t \rVert_2^2 .$$The penalty $\lambda$ controls how much the tasks are tied together. As$\lambda \to 0$ each task is free and we recover independent fits (each taskabsorbs everything into its own $v_t$). As $\lambda \to \infty$ alldeviations are forced to zero, $\beta_t = \beta_0$ for every task, and werecover a single pooled model fit on the stacked data. Intermediate$\lambda$ interpolates between these two extremes, which is exactly themulti-task regime: borrow strength across tasks without forcing them to beidentical. This is the regularized multi-task formulation of@Evgeniou_2004.The objective is a single quadratic in the stacked parameter vector$\theta = (\beta_0, v_1, \dots, v_T)$, so it has a closed form. Build a blockdesign matrix $Z$ where the columns for $\beta_0$ repeat $X_t$ across alltask rows and the columns for $v_t$ contain $X_t$ only on task $t$'s rows.Stack all responses into $y$. With a diagonal penalty matrix $P$ thatpenalizes only the $v_t$ blocks (entries $\lambda$) and not $\beta_0$(entries $0$), the solution is ridge-like:$$\hat\theta = (Z^\top Z + P)^{-1} Z^\top y .$$::: {.callout-note}This is just ridge regression on a cleverly laid-out designmatrix. The only twist is that the penalty matrix $P$ leaves the sharedblock $\beta_0$ unpenalized and applies $\lambda$ only to the deviationblocks $v_t$. Once you see that, the multi-task estimator is no harder tocompute than any other ridge fit.:::Each task estimate is then $\hat\beta_t = \hat\beta_0 + \hat v_t$. The nextsection builds exactly this $Z$ and $P$ in base R.### Deriving the closed form from the normal equationsThe block formulation above hides the structure. It is worth deriving thesolution directly, because the stationarity conditions expose exactly how theshared center and the deviations are coupled. Write the objective as$$F(\beta_0, v_1, \dots, v_T)= \sum_{t=1}^{T} \lVert y_t - X_t \beta_0 - X_t v_t \rVert_2^2\;+\; \lambda \sum_{t=1}^{T} \lVert v_t \rVert_2^2 .$$Differentiate with respect to each block and set the gradient to zero. For afixed task $t$,$$\frac{\partial F}{\partial v_t}= -2 X_t^\top (y_t - X_t \beta_0 - X_t v_t) + 2 \lambda v_t = 0 ,$$which rearranges to the per-task stationarity condition$$(X_t^\top X_t + \lambda I)\, v_t = X_t^\top (y_t - X_t \beta_0).$$ {#eq-transfer-multitask-learning-vt-stationary}So given the shared center, each deviation is a ridge fit to that task'sresidual,$$\hat v_t(\beta_0) = (X_t^\top X_t + \lambda I)^{-1} X_t^\top (y_t - X_t \beta_0)\equiv A_t (y_t - X_t \beta_0),\qquad A_t := (X_t^\top X_t + \lambda I)^{-1} X_t^\top .$$Differentiating with respect to the shared block gives$$\frac{\partial F}{\partial \beta_0}= -2 \sum_{t=1}^{T} X_t^\top (y_t - X_t \beta_0 - X_t v_t) = 0 ,$$so $\sum_t X_t^\top (y_t - X_t \beta_0 - X_t \hat v_t) = 0$. Substituting@eq-transfer-multitask-learning-vt-stationary, namely$X_t^\top X_t \hat v_t = X_t^\top(y_t - X_t\beta_0) - \lambda \hat v_t$, into$X_t^\top(y_t - X_t\beta_0) - X_t^\top X_t \hat v_t = \lambda \hat v_t$ shows the$\beta_0$ condition collapses to$$\sum_{t=1}^{T} \lambda \, \hat v_t = 0\quad\Longleftrightarrow\quad\sum_{t=1}^{T} \hat v_t = 0 .$$ {#eq-transfer-multitask-learning-deviation-sum}The deviations sum to zero at the optimum: the shared center is the point fromwhich the task-specific corrections balance out, which is the precise sense inwhich $\beta_0$ is a "center." Substituting$\hat v_t = A_t(y_t - X_t\beta_0)$ into@eq-transfer-multitask-learning-deviation-sum yields a single linear systemfor $\beta_0$,$$\left( \sum_{t=1}^{T} A_t X_t \right) \beta_0 = \sum_{t=1}^{T} A_t y_t ,$$after which each $\hat v_t$ follows in closed form. This is algebraicallyidentical to the block solution$\hat\theta = (Z^\top Z + P)^{-1} Z^\top y$, but it makes the mechanismtransparent: profiling out the deviations leaves a $p \times p$ system in theshared center, so the multi-task fit costs essentially one ridge solve plus $T$small ridge solves, not one dense $(T+1)p$ solve, when implemented carefully.### The ridge-on-deviations as a Gaussian priorThe penalty $\lambda \sum_t \lVert v_t \rVert_2^2$ is not arbitrary. It is thenegative log of a hierarchical Gaussian prior. Take$$\beta_t = \beta_0 + v_t, \qquadv_t \sim \mathcal{N}(0, \tau^2 I), \qquady_t \mid X_t, \beta_t \sim \mathcal{N}(X_t \beta_t, \sigma^2 I),$$with a flat prior on $\beta_0$. The negative log posterior is, up to constants,$$\frac{1}{2\sigma^2} \sum_t \lVert y_t - X_t(\beta_0 + v_t) \rVert_2^2+ \frac{1}{2\tau^2} \sum_t \lVert v_t \rVert_2^2 ,$$which is exactly the joint objective with $\lambda = \sigma^2 / \tau^2$. Thesharing strength is therefore an estimate of the noise-to-heterogeneity ratio.Small $\tau^2$ (tasks tightly clustered around the center) sends$\lambda \to \infty$ and recovers pooling; large $\tau^2$ (tasks free to roam)sends $\lambda \to 0$ and recovers independent fits. This is the formal contentof the empirical observation that the best $\lambda$ is interior whenever thetasks are related but not identical: $\tau^2$ is finite and positive.::: {.callout-note}The hierarchical-prior reading turns "tune $\lambda$ by cross-validation" into"estimate the heterogeneity $\tau^2$." With many tasks one can estimate$\sigma^2$ and $\tau^2$ directly by marginal (empirical Bayes) likelihood andset $\lambda = \hat\sigma^2 / \hat\tau^2$, which is the multi-task analogue ofJames-Stein shrinkage and avoids a cross-validation grid entirely.:::### Shrinkage, bias, and variance in the orthonormal caseTo see the bias-variance trade-off in closed form, specialize to balancedorthonormal designs: $n_t = n$ and $X_t^\top X_t = n I$ for every task. Then$A_t = (n I + \lambda I)^{-1} X_t^\top$, and by symmetry the shared center is theaverage of the per-task OLS solutions. Writing$\hat\beta_t^{\text{ols}} = (X_t^\top X_t)^{-1} X_t^\top y_t$ and$\bar\beta^{\text{ols}} = \tfrac{1}{T}\sum_t \hat\beta_t^{\text{ols}}$, themulti-task estimate becomes a convex combination$$\hat\beta_t= (1 - \alpha)\, \hat\beta_t^{\text{ols}} + \alpha\, \bar\beta^{\text{ols}},\qquad\alpha = \frac{\lambda}{\,n + \lambda\,} \in [0, 1].$$ {#eq-transfer-multitask-learning-shrinkage}The estimator shrinks each task's OLS solution toward the cross-task mean by afraction $\alpha$ that grows with $\lambda$ and shrinks with sample size $n$.This is the exact analogue of ridge shrinkage and of James-Stein toward acommon mean. With $\beta_t = \beta_0 + \delta_t$ for true deviations $\delta_t$and per-coordinate OLS variance $\sigma^2/n$, the per-task mean squared error of@eq-transfer-multitask-learning-shrinkage decomposes (treating the mean asapproximately unbiased for $\beta_0$ when $T$ is large) as$$\mathbb{E}\,\lVert \hat\beta_t - \beta_t \rVert_2^2\approx\underbrace{\alpha^2 \lVert \delta_t \rVert_2^2}_{\text{bias}^2}+ \underbrace{(1-\alpha)^2 \frac{p\,\sigma^2}{n}}_{\text{variance}} ,$$ignoring the $O(1/T)$ variance of the mean. Minimizing over $\alpha$ gives theoptimal shrinkage$$\alpha^\star= \frac{p\,\sigma^2 / n}{\,p\,\sigma^2/n + \lVert \delta_t \rVert_2^2\,},\qquad\text{equivalently}\qquad\lambda^\star = \frac{p\,\sigma^2}{\lVert \delta_t \rVert_2^2}\;\;\text{(per coordinate, } \tau^2 = \lVert\delta_t\rVert_2^2/p\text{)}.$$Three readings of this formula are worth stating. First, more noise or smallersamples ($\sigma^2/n$ large) push $\alpha^\star$ toward 1: borrow more. Second,more genuine heterogeneity ($\lVert\delta_t\rVert_2^2$ large) pushes$\alpha^\star$ toward 0: borrow less. Third, $\alpha^\star$ is strictly interiorwhenever both terms are positive and finite, which is the algebraic reason theU-shaped curve in the demonstration has an interior minimum. The formula alsorecovers the failure modes: when the tasks are unrelated,$\lVert\delta_t\rVert_2^2 \to \infty$ forces $\alpha^\star \to 0$ and anypositive $\lambda$ is harmful, which is negative transfer in a single equation.The convex-combination identity@eq-transfer-multitask-learning-shrinkage is easy to confirm numerically.The check below builds orthonormal designs so that $X_t^\top X_t = n I$ exactly,solves the full block system, and compares the result to the shrinkage formula.```{r mtl-shrinkage-check}set.seed(7)T_chk <-4; p_chk <-3; n_chk <-20; lam <-5# Orthonormal designs: X_t^T X_t = n I via scaled orthonormal columns.mk_orth <-function() { Q <-qr.Q(qr(matrix(rnorm(n_chk * p_chk), n_chk, p_chk))) Q *sqrt(n_chk) # columns now have squared norm n}Xs <-replicate(T_chk, mk_orth(), simplify =FALSE)bt <-replicate(T_chk, rnorm(p_chk), simplify =FALSE)ys <-Map(function(X, b) as.numeric(X %*% b) +rnorm(n_chk), Xs, bt)# Block solve (same construction as fit_mtl).N <- T_chk * n_chkZ <-matrix(0, N, p_chk * (T_chk +1)); yv <-numeric(N); r0 <-0for (t in1:T_chk) { rows <- (r0 +1):(r0 + n_chk) Z[rows, 1:p_chk] <- Xs[[t]] Z[rows, (t * p_chk +1):((t +1) * p_chk)] <- Xs[[t]] yv[rows] <- ys[[t]]; r0 <- r0 + n_chk}pen <-c(rep(0, p_chk), rep(lam, p_chk * T_chk))theta <-solve(crossprod(Z) +diag(pen), crossprod(Z, yv))b0 <- theta[1:p_chk]beta_block <-lapply(1:T_chk, function(t) b0 + theta[(t * p_chk +1):((t +1) * p_chk)])# Shrinkage formula: convex combo of per-task OLS and their mean.ols <-Map(function(X, y) solve(crossprod(X), crossprod(X, y)), Xs, ys)obar <-Reduce(`+`, ols) / T_chkalpha <- lam / (n_chk + lam)beta_form <-lapply(ols, function(b) (1- alpha) * b + alpha * obar)max_abs_diff <-max(abs(unlist(beta_block) -unlist(beta_form)))cat("max |block - shrinkage formula| =", signif(max_abs_diff, 3), "\n")```The discrepancy is at the level of floating-point error, confirming that in theorthonormal case the block estimator is exactly the shrinkage of each task's OLStoward the cross-task mean with weight $\alpha = \lambda / (n + \lambda)$.## Runnable demonstrationWe simulate $T$ related regression tasks. Each task's true coefficientsequal a shared vector plus a small random deviation, so the tasks are relatedbut not identical. Each task gets only a modest sample, the situation whereborrowing strength should help. We compare three estimators: independent OLSper task, a single pooled model, and the joint multi-task estimator across agrid of $\lambda$. We evaluate by out-of-sample mean squared error againstthe known truth.```{r mtl-simulate}set.seed(2026)T_tasks <-6# number of tasksp <-5# number of predictors per taskn_train <-25# small training sample per taskn_test <-500# large test sample per task# Shared coefficient center plus small task-specific deviations.beta0_true <-c(1.5, -2.0, 0.0, 1.0, -0.5)sigma_dev <-0.4# how far tasks drift from the shared centersigma_eps <-1.0# noise standard deviation# True per-task coefficients: beta_t = beta0 + deviation_tbeta_true <-lapply(1:T_tasks, function(t) { beta0_true +rnorm(p, mean =0, sd = sigma_dev)})make_task <-function(t, n) { X <-matrix(rnorm(n * p), nrow = n, ncol = p) y <-as.numeric(X %*% beta_true[[t]]) +rnorm(n, sd = sigma_eps)list(X = X, y = y)}train <-lapply(1:T_tasks, function(t) make_task(t, n_train))test <-lapply(1:T_tasks, function(t) make_task(t, n_test))```Independent OLS, fit one task at a time.```{r mtl-independent}fit_independent <-function(task) {# No intercept: data are centered around zero by construction.solve(crossprod(task$X), crossprod(task$X, task$y))}beta_ind <-lapply(train, fit_independent)```The pooled model stacks all tasks and fits a single coefficient vector,the $\lambda \to \infty$ limit of the joint objective.```{r mtl-pooled}X_all <-do.call(rbind, lapply(train, `[[`, "X"))y_all <-do.call(c, lapply(train, `[[`, "y"))beta_pool <-solve(crossprod(X_all), crossprod(X_all, y_all))```Now the joint multi-task estimator. We build the block matrix $Z$ and thepenalty $P$ described in the derivation, then solve the ridge-like system.```{r mtl-joint}fit_mtl <-function(train, lambda, p, T_tasks) { n_t <-sapply(train, function(z) length(z$y)) N <-sum(n_t)# Z has p shared columns followed by p columns per task. Z <-matrix(0, nrow = N, ncol = p * (T_tasks +1)) y <-numeric(N) row0 <-0for (t in1:T_tasks) { rows <- (row0 +1):(row0 + n_t[t]) Z[rows, 1:p] <- train[[t]]$X # shared block beta0 cols <- (t * p +1):((t +1) * p) # deviation block v_t Z[rows, cols] <- train[[t]]$X y[rows] <- train[[t]]$y row0 <- row0 + n_t[t] }# Penalty: 0 on the shared block, lambda on every deviation block. pen <-c(rep(0, p), rep(lambda, p * T_tasks)) P <-diag(pen) theta <-solve(crossprod(Z) + P, crossprod(Z, y)) beta0 <- theta[1:p] betas <-lapply(1:T_tasks, function(t) { beta0 + theta[(t * p +1):((t +1) * p)] })list(beta0 = beta0, betas = betas)}```A common test-MSE helper, evaluated against held-out data per task.```{r mtl-eval}test_mse <-function(beta_list) { errs <-sapply(1:T_tasks, function(t) { pred <-as.numeric(test[[t]]$X %*% beta_list[[t]])mean((test[[t]]$y - pred)^2) })mean(errs)}mse_ind <-test_mse(beta_ind)mse_pool <-test_mse(rep(list(beta_pool), T_tasks))lambda_grid <-10^seq(-2, 3, length.out =30)mse_mtl <-sapply(lambda_grid, function(lam) { fit <-fit_mtl(train, lam, p, T_tasks)test_mse(fit$betas)})best_idx <-which.min(mse_mtl)best_lambda <- lambda_grid[best_idx]best_mse <- mse_mtl[best_idx]results <-data.frame(method =c("Independent OLS", "Pooled", "Multi-task (best lambda)"),test_mse =round(c(mse_ind, mse_pool, best_mse), 4))print(results)cat("Best lambda:", round(best_lambda, 3), "\n")```Reading the printed table: the multi-task estimator at its best $\lambda$ hasthe lowest mean test MSE, beating both independent OLS (too noisy, becauseeach task fits only 25 points) and pooling (too biased, because the tasks arenot identical). The reported best $\lambda$ is an interior value, neither nearzero nor near infinity, which is the signature of genuine borrowing.@fig-transfer-multitask-learning-mse-curve shows how multi-tasktest error varies with $\lambda$, with theindependent and pooled baselines as horizontal references. The U shape is thebias-variance trade-off across the sharing strength: too little sharingbehaves like the noisy independent fits, too much sharing behaves like thebiased pooled fit, and the minimum sits in between.```{r fig-transfer-multitask-learning-mse-curve, fig.width=7, fig.height=4.5, fig.cap="Multi-task test MSE versus sharing strength, against independent and pooled baselines."}library(ggplot2)df <-data.frame(lambda = lambda_grid, mse = mse_mtl)ggplot(df, aes(x = lambda, y = mse)) +geom_line(color ="steelblue", linewidth =1) +geom_point(color ="steelblue") +geom_hline(yintercept = mse_ind, linetype ="dashed", color ="firebrick") +geom_hline(yintercept = mse_pool, linetype ="dotted", color ="darkgreen") +annotate("point", x = best_lambda, y = best_mse, color ="black", size =3) +scale_x_log10() +annotate("text", x =min(lambda_grid), y = mse_ind,label ="Independent OLS", color ="firebrick",hjust =0, vjust =-0.6, size =3.3) +annotate("text", x =min(lambda_grid), y = mse_pool,label ="Pooled", color ="darkgreen",hjust =0, vjust =-0.6, size =3.3) +labs(x =expression(lambda~"(log scale)"),y ="Mean test MSE across tasks",title ="Borrowing strength across related tasks") +theme_minimal(base_size =12)```With small per-task samples and genuinely related tasks, the multi-taskestimator at its best $\lambda$ should beat both baselines: it avoids thehigh variance of independent OLS and the bias of pooling. If you increase`n_train` substantially, the independent fits improve and the gap shrinks,because each task no longer needs to borrow. If you increase `sigma_dev` sothe tasks drift far apart, pooling and small-$\lambda$ multi-task degrade,which is negative transfer, discussed next.::: {.callout-tip}Try editing `n_train`, `sigma_dev`, and `sigma_eps` and rerunning.Watching the U-shaped curve flatten, deepen, or invert builds far moreintuition than any single number in the table. The whole point of a runnabledemonstration is that the trade-off is yours to probe.:::## Negative transferSharing helps only when tasks are actually related. When they are not, forcingthem to share hurts, and the shared model does worse than independent models.This failure is called negative transfer, surveyed in @Pan_2010 and@Zhang_2021.::: {.callout-warning}Negative transfer is not a rare edge case; it is the defaultoutcome when you assume relatedness without checking it. The discipline thatprotects you is simple: always keep an independent-model baseline and refuseto ship a shared model that loses to it on a clean target holdout.:::You can see it directly in the demonstration. The pooled model is the$\lambda \to \infty$ limit, and when `sigma_dev` is large the pooled MSE risesabove the independent MSE: the tasks are too different to share a singlecoefficient vector. The remedy is to let $\lambda$ be chosen by validationrather than fixed, so the data decide how much sharing is appropriate, and tolet it go small when sharing does not pay.Practical signals of negative transfer: a multi-task model whose per-taskmetrics are worse than separately trained models; one dominant task whosegradients swamp the others; or source and target distributions that lookunrelated under a two-sample test. The defenses are task weighting (nextsection), grouping only tasks that are similar, and architectures that shareless (soft sharing, or task-specific layers). Closely related is ensemblelearning (@sec-ensemble-learning), where combining diverse models is preferredto forcing one model to serve mismatched objectives.## Task weightingWhen tasks are trained jointly the total loss is a weighted sum,$$\mathcal{L} = \sum_{t=1}^{T} w_t \, \mathcal{L}_t ,$$and the weights $w_t$ matter.^[Setting every $w_t = 1$ looks neutral but isnot: it implicitly weights each task by its raw loss scale, so a task measuredin larger units silently dominates.] If one task has a much larger loss scaleor many more examples, it dominates the gradient and the others are under-fit.The simplest fix is to normalize each task's loss to a comparable scale.Beyond that, two learned schemes are common.*Uncertainty weighting* (@Kendall_2018) treats each task's noise level as aparameter $\sigma_t$ and derives weights from a Gaussian likelihood. Forregression tasks the objective becomes$$\mathcal{L}= \sum_{t=1}^{T}\left( \frac{1}{2 \sigma_t^2}\, \mathcal{L}_t + \log \sigma_t \right),$$so a task with high uncertainty gets a small weight $1 / (2\sigma_t^2)$, andthe $\log \sigma_t$ term prevents the weights from collapsing to zero. The$\sigma_t$ are learned by gradient descent alongside the model.To see where this objective comes from, model each regression task as Gaussianwith its own observation noise, $y_t \sim \mathcal{N}(f_t(x), \sigma_t^2)$. Thenegative log likelihood of one task, dropping the constant $\frac12\log 2\pi$,is$$-\log p(y_t \mid f_t(x))= \frac{1}{2\sigma_t^2} \lVert y_t - f_t(x) \rVert_2^2 + \log \sigma_t ,$$and summing over tasks (which assumes conditional independence across tasksgiven the shared representation) gives exactly the displayed objective with$\mathcal{L}_t = \lVert y_t - f_t(x)\rVert_2^2$. The weighting is therefore not aheuristic; it is maximum likelihood for a heteroscedastic multi-task Gaussianmodel in which each task is allowed its own noise scale. Profiling out$\sigma_t$ makes the mechanism explicit: setting$\partial \mathcal{L} / \partial \sigma_t = 0$ gives$$-\frac{\mathcal{L}_t}{\sigma_t^3} + \frac{1}{\sigma_t} = 0\quad\Longrightarrow\quad\hat\sigma_t^2 = \mathcal{L}_t ,$$so at the optimum the learned noise variance equals the task's own residualloss, and back-substituting yields a profiled objective$\sum_t \big(\tfrac12 + \tfrac12\log \mathcal{L}_t\big)$, that is, a sum of loglosses. The practical effect is automatic, scale-free balancing: a task with alarge irreducible loss is downweighted in proportion to that loss, so no taskdominates merely because it is measured in larger units or is intrinsicallynoisier. (In practice one optimizes $s_t = \log \sigma_t^2$ rather than$\sigma_t$ for numerical stability and to keep the variance positive.)*Gradient normalization* (@Chen_2018) instead adjusts the weights so that thetasks train at similar rates, by balancing the magnitudes of their gradientson the shared parameters. Both methods aim at the same goal: keep any singletask from dominating, which is one of the main causes of negative transfer.## Practical guidance and pitfallsWhen to reach for these methods:- The target task has little labeled data but a related, data-rich source exists. Transfer learning is the default starting point.- You must predict several related outcomes from the same inputs. A multi-task model can beat separate models and is cheaper to serve.- Your training and serving input distributions differ. Domain adaptation, through importance weighting or representation alignment, is the repair.Pitfalls to watch:- *Assuming relatedness.* Verify it. Check whether sharing actually improves held-out per-task metrics versus independent baselines. If it does not, you have negative transfer.- *Data leakage during transfer.* If you select the source model or tune $\lambda$ using the target test set, your reported gains are optimistic. Keep a clean target holdout.- *Distribution shift in the source.* A pretrained backbone encodes its source distribution. If the target inputs differ sharply (different sensors, languages, time periods), early-layer features may not transfer.- *Catastrophic forgetting during fine-tuning.* Too high a learning rate erases the pretrained structure. Use small learning rates and gradual unfreezing.- *Imbalanced tasks.* Normalize loss scales and consider learned task weighting so that large or noisy tasks do not dominate.- *Tuning the sharing strength on the wrong signal.* In the linear model, pick $\lambda$ by cross-validation on the target, not by training error, which always prefers small $\lambda$.A reasonable default recipe: start with feature extraction or a moderatelypenalized multi-task model, establish an independent-model baseline, tune thesharing strength on validation data, and only move to full fine-tuning orstrong sharing if the data are abundant and the relatedness is confirmed.::: {.callout-important title="Key idea"}Every method in this chapter is a knob on a single dial, fromfully independent models to a single shared model. Transfer, multi-task,domain adaptation, and task weighting are all ways of choosing, anddefending, where to set that dial. Let validation data, not optimism, makethe choice.:::## Further reading- Caruana (1997), the foundational treatment of multi-task learning as inductive transfer.- Evgeniou and Pontil (2004), regularized multi-task learning, the formulation used in this chapter's demonstration.- Pan and Yang (2010), a broad survey of transfer learning.- Yosinski, Clune, Bengio, and Lipson (2014), on how transferable features are across layers of deep networks.- Ganin et al. (2016), domain-adversarial training for domain adaptation.- Ruder (2017), an overview of multi-task learning in deep networks, including hard and soft parameter sharing.- Kendall, Gal, and Cipolla (2018), uncertainty-based task weighting.- Chen et al. (2018), GradNorm for gradient-based task balancing.- Howard and Ruder (2018), discriminative fine-tuning and gradual unfreezing.- Zhang and Yang (2021), a recent survey of multi-task learning methods.