Advanced Data Analysis

Nguyen, Mike

88 Machine Learning for Time-Series Forecasting

Most of the methods in this book assume that the rows of a data set are exchangeable: we could shuffle them without losing anything, and a random train/test split is a fair test of generalization. Time series break that assumption. A measurement at noon is not interchangeable with a measurement at midnight, last month’s value tells you a great deal about this month’s, and the future is, by definition, the part of the data you have not seen yet. Forecasting is the task of using the ordered past to say something useful about the ordered future.

This chapter builds the topic from the ground up. We start with the forecasting setup and the notation, explain why naive cross-validation leaks information across time, and then construct the two ingredients that make modern machine-learning forecasters work: lag features and rolling-window summaries. With those in hand we compare three families of forecasters on a real series: classical statistical baselines (the naive forecast, exponential smoothing, and ARIMA), tree-based machine learning (gradient boosting, Chapter 12), and a brief look at deep sequence models. Throughout, the running example forecasts monthly airline passenger counts, and every claim about accuracy is backed by a rolling-origin evaluation rather than a single lucky split.

Intuition

A forecaster is a function that maps “everything known up to now” into “a guess about what comes next.” The art is choosing what to feed it (features) and how to score it honestly (evaluation that respects time).

88.1 The Forecasting Setup

Let $\{y_t\}_{t=1}^{T}$ be a univariate time series observed at regular intervals, where $t$ indexes time and $y_t \in \mathbb{R}$ is the value at time $t$. We may also observe exogenous predictors $\mathbf{x}_t \in \mathbb{R}^{p}$ (for example a holiday flag or a promotion indicator). The information set available at time $t$ is

\[ \mathcal{F}_t = \{\, y_s, \mathbf{x}_s : s \le t \,\}, \]

the entire ordered history through time $t$. A point forecast made at the forecast origin $t$ for a target $h$ steps ahead is written

\[ \hat{y}_{t+h\mid t} = f\!\left(\mathcal{F}_t\right), \]

read as “the prediction of $y_{t+h}$ given information through $t$.” The integer $h \ge 1$ is the forecast horizon. Forecasting one step ahead ($h=1$) is the easiest case; multi-step forecasting ($h > 1$) is harder because errors accumulate and because the inputs you would like to use (recent values of $y$) are themselves unknown for the steps in between.

The quantity we ultimately care about is the forecast error,

\[ e_{t+h} = y_{t+h} - \hat{y}_{t+h\mid t}, \]

and a forecasting model is good if these errors are small and unbiased across many origins.

Key idea

Two numbers define every forecasting problem before you write any code: the horizon $h$ (how far ahead) and the origin (the last time point you are allowed to use). Fixing them up front prevents almost every subtle mistake that follows.

88.1.1 Decomposition and What Models Try to Capture

A classical way to reason about a series is to imagine it as the sum of unobserved components,

\[ y_t = T_t + S_t + R_t, \]

where $T_t$ is a slowly varying trend, $S_t$ is a seasonal pattern that repeats with known period $m$ (for monthly data with yearly seasonality, $m=12$), and $R_t$ is the remainder (the irregular part). When seasonal swings grow with the level of the series, a multiplicative form $y_t = T_t \times S_t \times R_t$ fits better, and taking logs turns it back into an additive problem because $\log y_t = \log T_t + \log S_t + \log R_t$. Every forecasting method in this chapter is, in effect, a different bet about how trend, seasonality, and noise combine.

88.2 Splitting Time Honestly: Backtesting and Leakage

The single most common error in applied forecasting is evaluating a model with a random split or with standard $k$-fold cross-validation. Both place future observations in the training set and past observations in the test set, which lets the model “peek” at information it would never have at deployment time. The result is an optimistic accuracy estimate that collapses in production.

Warning

Never shuffle a time series before splitting, and never use plain $k$-fold cross-validation on raw temporal data. Any procedure that lets a training point come from after a test point is leaking the future.

The honest alternative is rolling-origin evaluation, also called time-series cross-validation or backtesting. We pick an initial training window, forecast the next $h$ steps, record the error, then move the origin forward and repeat. There are two common variants:

Expanding window: the training set grows as the origin advances (use all history up to the origin). Good when more history always helps.
Sliding window: the training set keeps a fixed length and drops the oldest points. Good when the data-generating process drifts and old data is misleading.

Formally, with origins $t = t_0, t_0+1, \dots, T-h$, the rolling-origin estimate of a loss $L$ at horizon $h$ is the average over origins,

\[ \widehat{\text{CV}}_h = \frac{1}{N}\sum_{t=t_0}^{T-h} L\!\left(y_{t+h},\, \hat{y}_{t+h\mid t}\right), \qquad N = T - h - t_0 + 1 . \]

Figure 88.1 illustrates the expanding-window scheme: each row is one origin, blue points are training, the orange point is the value being forecast, and gray points are not yet visible to the model.

Show code

set.seed(1)
n_pts <- 16
origins <- 8:13
plot(NA, xlim = c(1, n_pts), ylim = c(length(origins) + 0.5, 0.5),
     xlab = "time index", ylab = "rolling origin (row)", yaxt = "n",
     main = "Rolling-origin (expanding window)")
axis(2, at = seq_along(origins), labels = origins, las = 1)
for (r in seq_along(origins)) {
  o <- origins[r]
  train <- 1:o
  test <- o + 1
  future <- if (test < n_pts) (test + 1):n_pts else integer(0)
  points(train, rep(r, length(train)), pch = 19, col = "steelblue", cex = 1.1)
  points(test, r, pch = 17, col = "darkorange", cex = 1.5)
  if (length(future))
    points(future, rep(r, length(future)), pch = 1, col = "gray70", cex = 1.1)
}
legend("topright", legend = c("train", "forecast target", "hidden future"),
       pch = c(19, 17, 1), col = c("steelblue", "darkorange", "gray70"),
       bty = "n", cex = 0.9)

Figure 88.1: Expanding-window rolling-origin evaluation. Each horizontal row is one forecast origin; the model trains on the blue points and is scored on the orange (future) point, while the gray points remain hidden. The origin advances downward, so no training point ever comes from after the point being scored.

Figure 88.1 makes the rule visual: read any row left to right and the orange target always sits to the right of every blue training point.

Note

Leakage also hides inside feature engineering. If you scale features using the mean and standard deviation of the whole series, or impute missing values with a global statistic, the training rows have absorbed information from the test period. Compute every such statistic on the training window only, inside each fold.

88.3 Turning a Time Series into a Supervised Learning Problem

Machine-learning regressors such as linear models, random forests, and gradient boosting do not know about time. They expect a feature matrix $\mathbf{X}$ and a target vector $\mathbf{y}$. The bridge from a series to that format is lag embedding: we define the target as the value $h$ steps ahead and the features as a window of recent past values.

For one-step-ahead forecasting with $k$ lags, the design is

\[ \underbrace{y_t}_{\text{target}} \;\longleftarrow\; \big(\, y_{t-1},\, y_{t-2},\, \dots,\, y_{t-k} \,\big), \]

so each training row is a length-$k$ snapshot of history paired with the next value. This is exactly the autoregressive idea: the order-$k$ autoregressive model $\text{AR}(k)$ is the special case where $f$ is linear,

\[ \hat{y}_{t} = c + \sum_{j=1}^{k}\phi_j\, y_{t-j}. \]

Replacing that linear $f$ with a tree ensemble or any other learner gives a nonlinear autoregression, which is the workhorse of machine-learning forecasting.

Two enrichments make lag features much stronger:

Rolling-window summaries. Beyond raw lags, summarize recent history with a moving statistic computed strictly from past values, for example a trailing mean over the last $w$ points, $\bar{y}_{t-1}^{(w)} = \frac{1}{w}\sum_{j=1}^{w} y_{t-j}$, or a trailing standard deviation. These smooth out noise and hand the model a ready-made notion of “recent level” and “recent volatility.”
Calendar features. Encode the position in the seasonal cycle (month of year, day of week) so the model can learn seasonality directly. For monthly data the month index $1,\dots,12$ is enough for a tree; linear models prefer sine/cosine pairs such as $\sin(2\pi m_t / 12)$ and $\cos(2\pi m_t / 12)$.

Warning

Every feature must be computable at the forecast origin. A lag of order 1 uses $y_{t-1}$, which is fine. A “centered” moving average that averages $y_{t-1}, y_t, y_{t+1}$ is not: it uses the future. When in doubt, ask whether the feature could have been written down the moment before $y_t$ was observed.

88.3.1 Multi-Step Horizons: Recursive vs Direct

Once we want $h > 1$ steps, there are two standard strategies.

The recursive strategy trains a single one-step model and feeds its own predictions back as inputs: forecast $\hat{y}_{t+1\mid t}$, then treat it as if observed to forecast $\hat{y}_{t+2\mid t}$, and so on. It is simple and data-efficient but lets one-step errors compound across the horizon.

The direct strategy trains a separate model for each horizon, with target $y_{t+h}$ regressed on features available at $t$, for each $h = 1, \dots, H$. It avoids error feedback at the cost of fitting $H$ models and ignoring the relationship between consecutive horizons.

When to use this

Use the recursive strategy when you have little data or a short horizon, and the direct strategy when the horizon is long, you have enough data to fit several models, and you can tolerate the extra bookkeeping. Hybrid schemes exist, but starting with recursive and checking whether direct helps is a sound default.

88.4 A Worked Comparison on AirPassengers

We now forecast the built-in AirPassengers series: monthly totals of international airline passengers from 1949 to 1960, a classic example with a clear upward trend and strong yearly seasonality whose amplitude grows over time. We hold out the final 24 months as a test set, forecast them, and compare a naive seasonal baseline, exponential smoothing (ETS), ARIMA, and an xgboost regressor built on lag and calendar features.

The multiplicative seasonality (swings widen as the level rises) is exactly the case where a log transform helps, so the machine-learning model is trained on $\log y_t$ and predictions are exponentiated back.

Show code

suppressPackageStartupMessages({
  library(forecast)
  library(xgboost)
  library(Metrics)
})

data("AirPassengers")
y <- as.numeric(AirPassengers)
n <- length(y)
freq <- 12L

h_test <- 24L                       # hold out final 2 years
train_idx <- 1:(n - h_test)
test_idx <- (n - h_test + 1):n

# Calendar position (month of year) for every observation
month_of_year <- as.integer(cycle(AirPassengers))

88.4.1 Building the Supervised Frame

We construct lag features (orders 1, 2, 3, and the seasonal lag 12), trailing rolling summaries computed only from past values, and the month index. The helper below returns a data frame aligned so that row $t$ holds the target $y_t$ and features that were all knowable at time $t-1$.

Show code

make_supervised <- function(y, month, lags = c(1, 2, 3, 12),
                            roll_windows = c(3, 12)) {
  N <- length(y)
  df <- data.frame(target = y, month = month)
  for (L in lags) {
    df[[paste0("lag", L)]] <- c(rep(NA, L), y[1:(N - L)])
  }
  # Trailing rolling mean/sd that END at t-1 (no leakage)
  for (w in roll_windows) {
    rm_ <- rep(NA_real_, N)
    rs_ <- rep(NA_real_, N)
    for (t in seq_len(N)) {
      if (t - 1 >= w) {
        win <- y[(t - w):(t - 1)]
        rm_[t] <- mean(win)
        rs_[t] <- sd(win)
      }
    }
    df[[paste0("rollmean", w)]] <- rm_
    df[[paste0("rollsd", w)]] <- rs_
  }
  df
}

# Train on the log scale because seasonality is multiplicative
sup <- make_supervised(log(y), month_of_year)
sup_complete <- sup[stats::complete.cases(sup), ]
sup_complete$row <- as.integer(rownames(sup_complete))

feat_cols <- setdiff(names(sup_complete), c("target", "row"))
head(sup_complete[, c("row", "target", "lag1", "lag12", "rollmean3", "month")])
#>    row   target     lag1    lag12 rollmean3 month
#> 13  13 4.744932 4.770685 4.718499  4.731400     1
#> 14  14 4.836282 4.744932 4.770685  4.720003     2
#> 15  15 4.948760 4.836282 4.882802  4.783966     3
#> 16  16 4.905275 4.948760 4.859812  4.843325     4
#> 17  17 4.828314 4.905275 4.795791  4.896772     5
#> 18  18 5.003946 4.828314 4.905275  4.894116     6

Note

The first 12 rows have no seasonal lag and are dropped by complete.cases. This is normal: lag features always cost you a warm-up period equal to the largest lag.

88.4.2 Fitting the Models

We train ETS and ARIMA with the forecast package on the original scale (they handle the trend and seasonality internally), and an xgboost model on the engineered log-scale features. The seasonal naive forecast, which simply repeats the value from 12 months earlier, is our floor: any method that cannot beat it is not worth its complexity.

Show code

train_ts <- ts(y[train_idx], frequency = freq, start = c(1949, 1))

# Classical baselines
fit_snaive <- snaive(train_ts, h = h_test)
fit_ets <- forecast(ets(train_ts), h = h_test)
fit_arima <- forecast(auto.arima(train_ts), h = h_test)

# xgboost on engineered features (recursive multi-step below)
tr <- sup_complete[sup_complete$row <= max(train_idx), ]
dtrain <- xgb.DMatrix(as.matrix(tr[, feat_cols]), label = tr$target)

set.seed(42)
xgb_fit <- xgb.train(
  params = list(objective = "reg:squarederror", max_depth = 3,
                eta = 0.1, subsample = 0.9, colsample_bytree = 0.9),
  data = dtrain, nrounds = 300, verbose = 0
)

The boosting model is one-step-ahead by construction, so to cover the 24-month horizon we forecast recursively: predict the next month, append the prediction to the history, recompute the lag and rolling features, and step forward.

Show code

recursive_forecast <- function(model, y_log_hist, month_future, h,
                               lags = c(1, 2, 3, 12),
                               roll_windows = c(3, 12), feat_cols) {
  hist <- y_log_hist                    # log-scale history, grows as we predict
  preds <- numeric(h)
  for (i in seq_len(h)) {
    Ncur <- length(hist)
    row <- list(month = month_future[i])
    for (L in lags) row[[paste0("lag", L)]] <- hist[Ncur - L + 1]
    for (w in roll_windows) {
      win <- hist[(Ncur - w + 1):Ncur]
      row[[paste0("rollmean", w)]] <- mean(win)
      row[[paste0("rollsd", w)]] <- sd(win)
    }
    xrow <- as.matrix(as.data.frame(row)[, feat_cols, drop = FALSE])
    p <- predict(model, xgb.DMatrix(xrow))
    preds[i] <- p
    hist <- c(hist, p)                  # feed prediction back
  }
  preds
}

month_future <- month_of_year[test_idx]
xgb_log_pred <- recursive_forecast(
  xgb_fit, log(y[train_idx]), month_future, h_test, feat_cols = feat_cols
)
xgb_pred <- exp(xgb_log_pred)           # back to passenger counts

88.4.3 Visualizing the Forecasts

Figure 88.2 overlays the four forecasts on the held-out period. The seasonal naive forecast keeps the right shape but misses the growth; ETS and ARIMA track both trend and seasonality closely; the boosting model, trained only on lags and calendar position, recovers the seasonal pattern from data alone.

Show code

actual <- y[test_idx]
time_test <- time(AirPassengers)[test_idx]

plot(time_test, actual, type = "o", pch = 19, lwd = 2,
     ylim = range(actual, fit_ets$mean, fit_arima$mean, xgb_pred,
                  fit_snaive$mean),
     xlab = "year", ylab = "passengers (thousands)",
     main = "Hold-out forecasts vs actual")
lines(time_test, as.numeric(fit_snaive$mean), col = "gray50", lwd = 2, lty = 2)
lines(time_test, as.numeric(fit_ets$mean), col = "steelblue", lwd = 2)
lines(time_test, as.numeric(fit_arima$mean), col = "darkgreen", lwd = 2)
lines(time_test, xgb_pred, col = "darkorange", lwd = 2)
legend("topleft", bty = "n", lwd = 2,
       lty = c(1, 2, 1, 1, 1),
       col = c("black", "gray50", "steelblue", "darkgreen", "darkorange"),
       legend = c("actual", "seasonal naive", "ETS", "ARIMA", "xgboost"))

Figure 88.2: Forecasts of monthly AirPassengers counts over the 24-month hold-out period. The black line is the truth; colored lines are the seasonal naive, ETS, ARIMA, and xgboost forecasts. ETS, ARIMA, and xgboost all capture trend and seasonality, while the seasonal naive baseline misses the upward growth.

Figure 88.2 is a single-split snapshot. To judge the methods fairly we must average over many origins, which is the next step.

88.4.4 Rolling-Origin Evaluation

We score each method with an expanding-window backtest at horizon $h=1$. Starting from an initial window of 100 months, we forecast the next month, advance the origin, and repeat to the end of the series. The error metrics are the mean absolute error (MAE) and the root mean squared error (RMSE),

\[ \text{MAE} = \frac{1}{N}\sum_{i=1}^{N}\lvert e_i\rvert, \qquad \text{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N} e_i^{2}} , \]

and the mean absolute scaled error (MASE), which divides MAE by the in-sample MAE of the seasonal naive forecast so that a value below 1 means “better than the seasonal naive baseline.”

Show code

init <- 100L
origins <- init:(n - 1L)                 # one-step-ahead origins

err_snaive <- err_ets <- err_arima <- err_xgb <- numeric(length(origins))

for (k in seq_along(origins)) {
  o <- origins[k]
  ytr <- ts(y[1:o], frequency = freq, start = c(1949, 1))
  truth <- y[o + 1]

  # Seasonal naive: value from 12 months earlier
  sn <- y[o + 1 - freq]
  # ETS and ARIMA refit each origin (small series, fast)
  e1 <- as.numeric(forecast(ets(ytr), h = 1)$mean)
  a1 <- as.numeric(forecast(auto.arima(ytr), h = 1)$mean)

  # xgboost: rebuild supervised frame on log scale up to o, fit, predict 1 step
  sup_o <- make_supervised(log(y[1:o]), month_of_year[1:o])
  sup_o <- sup_o[stats::complete.cases(sup_o), ]
  d_o <- xgb.DMatrix(as.matrix(sup_o[, feat_cols]), label = sup_o$target)
  set.seed(42)
  m_o <- xgb.train(
    params = list(objective = "reg:squarederror", max_depth = 3,
                  eta = 0.1, subsample = 0.9, colsample_bytree = 0.9),
    data = d_o, nrounds = 200, verbose = 0
  )
  x1 <- recursive_forecast(m_o, log(y[1:o]), month_of_year[o + 1], 1,
                           feat_cols = feat_cols)
  xg <- exp(x1)

  err_snaive[k] <- truth - sn
  err_ets[k] <- truth - e1
  err_arima[k] <- truth - a1
  err_xgb[k] <- truth - xg
}

# In-sample seasonal-naive MAE for MASE scaling
scale_mae <- mean(abs(diff(y, lag = freq)))

summarise_err <- function(e) {
  c(MAE = mean(abs(e)), RMSE = sqrt(mean(e^2)), MASE = mean(abs(e)) / scale_mae)
}

results <- rbind(
  `Seasonal naive` = summarise_err(err_snaive),
  `ETS`            = summarise_err(err_ets),
  `ARIMA`          = summarise_err(err_arima),
  `xgboost`        = summarise_err(err_xgb)
)
results <- round(results, 3)

Table 88.1 reports the rolling-origin errors across 44 one-step forecasts. Lower is better on all three columns, and a MASE below 1 beats the seasonal naive baseline.

Show code

knitr::kable(
  results,
  caption = "Rolling-origin (expanding-window) one-step-ahead forecast accuracy on AirPassengers, averaged over all origins from month 100 to the end of the series. MAE and RMSE are in passenger-count units; MASE is scaled so that values below 1 beat the seasonal naive baseline."
)

Table 88.1: Rolling-origin (expanding-window) one-step-ahead forecast accuracy on AirPassengers, averaged over all origins from month 100 to the end of the series. MAE and RMSE are in passenger-count units; MASE is scaled so that values below 1 beat the seasonal naive baseline.

	MAE	RMSE	MASE
Seasonal naive	37.409	42.587	1.168
ETS	14.856	18.785	0.464
ARIMA	12.189	15.333	0.381
xgboost	23.749	30.387	0.741

Table 88.1 tells the realistic story: the statistical models (ETS, ARIMA) are very strong on a clean, strongly seasonal series like this one, and a generic boosting model trained on a handful of lag features is competitive but does not automatically dominate. That ordering is typical and worth internalizing.

Key idea

On short, clean, strongly seasonal univariate series, well-tuned classical methods are hard to beat. Machine-learning forecasters earn their keep when you have many related series, many exogenous predictors, nonlinear interactions, or irregular structure that ARIMA and ETS cannot express.

88.5 Where Deep Sequence Models Fit

When the data is long, high-dimensional, or shared across thousands of related series (retail demand across stores and products, sensor networks, energy load), deep models become attractive. Recurrent networks (LSTM, GRU) carry a hidden state forward through time, and Transformer-style architectures use self-attention to relate distant time points directly. These models can learn cross-series patterns through global training, which is exactly where the per-series classical methods struggle. The neural-network architectures behind them are developed in the chapters on neural networks (Chapter 15) and Transformers (Chapter 38); the code below is an illustrative Keras sketch of an LSTM forecaster and is not executed here, because it needs the Python/TensorFlow backend.

Show code

# Illustrative only (requires keras + TensorFlow Python backend).
library(keras)

# Suppose X has shape (samples, timesteps, features) and yv is the target.
model <- keras_model_sequential() |>
  layer_lstm(units = 32, input_shape = c(dim(X)[2], dim(X)[3])) |>
  layer_dense(units = 1)

model |> compile(optimizer = "adam", loss = "mse")
model |> fit(X, yv, epochs = 50, batch_size = 16, validation_split = 0.2)
preds <- model |> predict(X_future)

When to use this

Reach for deep sequence models when you are forecasting many series jointly, have rich covariates, or need to capture long-range and nonlinear dependencies. For a single short series, they are usually overkill and underperform a tuned ETS or ARIMA.

88.6 Practical Guidance and Pitfalls

Forecasting projects fail in predictable ways. The recurring lessons:

Always benchmark against naive methods. The seasonal naive and the last-value naive forecasts are free. If your elaborate model cannot beat them on a rolling-origin backtest, the complexity is not buying anything.
Evaluate with the deployment horizon. If you will forecast 12 weeks ahead, backtest at $h=12$, not $h=1$. One-step accuracy is a poor proxy for multi-step accuracy because errors compound.
Refit inside the backtest loop. Statistics used by features (means, scales) and model parameters must be recomputed on each training window. Fitting once on all data and then “evaluating” leaks the future, exactly the trap Table 88.1 avoids by refitting at every origin.
Mind non-stationarity. Trends and changing variance violate the assumptions of many models. Differencing ($\nabla y_t = y_t - y_{t-1}$), seasonal differencing, and log or Box-Cox transforms are the standard fixes; ARIMA’s middle “I” is integrated differencing.
Watch the warm-up cost. Large lags shorten your usable training data. A seasonal lag of 12 throws away the first year before any row is complete.
Tune the right thing. For boosting, the number of rounds, tree depth, and learning rate interact; tune them with the same rolling-origin scheme you use for evaluation, never with a random split.
Prediction intervals matter. A point forecast without uncertainty is half a forecast. ETS and ARIMA give intervals analytically; for machine-learning models use quantile loss (see the quantile regression chapter, Chapter 21) or backtest the empirical error distribution.

Warning

The most damaging bugs in forecasting are silent: a feature that secretly uses the future, a scaler fit on the whole series, or a backtest that forgot to refit. They inflate accuracy on paper and only reveal themselves after deployment. Audit every feature against the question “could this have been computed at the origin?” before trusting any number.

88.7 Further Reading

Hyndman, R. J., and Athanasopoulos, G. (2021). Forecasting: Principles and Practice, 3rd edition. The standard open-access reference for ETS, ARIMA, and modern evaluation; the source of the MASE metric and the rolling-origin philosophy used here.
Box, G. E. P., Jenkins, G. M., Reinsel, G. C., and Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control. The foundational treatment of ARIMA modeling.
Hyndman, R. J., Koehler, A. B., Ord, J. K., and Snyder, R. D. (2008). Forecasting with Exponential Smoothing: The State Space Approach. The theory behind the ETS framework.
Bergmeir, C., and Benitez, J. M. (2012). “On the use of cross-validation for time series predictor evaluation.” Information Sciences. On why and when time-aware validation is necessary.
Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2020). “The M4 Competition: 100,000 time series and 61 forecasting methods.” International Journal of Forecasting. Large-scale empirical comparison of statistical and machine-learning forecasters.
Januschowski, T., Gasthaus, J., Wang, Y., Salinas, D., Flunkert, V., Bohlke-Schneider, M., and Callot, L. (2020). “Criteria for classifying forecasting methods.” International Journal of Forecasting. A clear taxonomy of local vs global and statistical vs machine-learning approaches.

# Machine Learning for Time-Series Forecasting {#sec-time-series-forecasting} ```{r} #| include: false source("_common.R") ``` Most of the methods in this book assume that the rows of a data set are exchangeable: we could shuffle them without losing anything, and a random train/test split is a fair test of generalization. Time series break that assumption. A measurement at noon is not interchangeable with a measurement at midnight, last month's value tells you a great deal about this month's, and the future is, by definition, the part of the data you have not seen yet. Forecasting is the task of using the ordered past to say something useful about the ordered future. This chapter builds the topic from the ground up. We start with the forecasting setup and the notation, explain why naive cross-validation leaks information across time, and then construct the two ingredients that make modern machine-learning forecasters work: lag features and rolling-window summaries. With those in hand we compare three families of forecasters on a real series: classical statistical baselines (the naive forecast, exponential smoothing, and ARIMA), tree-based machine learning (gradient boosting, @sec-gradient-boosting), and a brief look at deep sequence models. Throughout, the running example forecasts monthly airline passenger counts, and every claim about accuracy is backed by a rolling-origin evaluation rather than a single lucky split. ::: {.callout-tip title="Intuition"} A forecaster is a function that maps "everything known up to now" into "a guess about what comes next." The art is choosing what to feed it (features) and how to score it honestly (evaluation that respects time). ::: ## The Forecasting Setup Let $\{y_t\}_{t=1}^{T}$ be a univariate time series observed at regular intervals, where $t$ indexes time and $y_t \in \mathbb{R}$ is the value at time $t$. We may also observe exogenous predictors $\mathbf{x}_t \in \mathbb{R}^{p}$ (for example a holiday flag or a promotion indicator). The information set available at time $t$ is $$ \mathcal{F}_t = \{\, y_s, \mathbf{x}_s : s \le t \,\}, $$ the entire ordered history through time $t$. A point forecast made at the forecast origin $t$ for a target $h$ steps ahead is written $$ \hat{y}_{t+h\mid t} = f\!\left(\mathcal{F}_t\right), $$ read as "the prediction of $y_{t+h}$ given information through $t$." The integer $h \ge 1$ is the forecast horizon. Forecasting one step ahead ($h=1$) is the easiest case; multi-step forecasting ($h > 1$) is harder because errors accumulate and because the inputs you would like to use (recent values of $y$) are themselves unknown for the steps in between. The quantity we ultimately care about is the forecast error, $$ e_{t+h} = y_{t+h} - \hat{y}_{t+h\mid t}, $$ and a forecasting model is good if these errors are small and unbiased across many origins. ::: {.callout-important title="Key idea"} Two numbers define every forecasting problem before you write any code: the horizon $h$ (how far ahead) and the origin (the last time point you are allowed to use). Fixing them up front prevents almost every subtle mistake that follows. ::: ### Decomposition and What Models Try to Capture A classical way to reason about a series is to imagine it as the sum of unobserved components, $$ y_t = T_t + S_t + R_t, $$ where $T_t$ is a slowly varying trend, $S_t$ is a seasonal pattern that repeats with known period $m$ (for monthly data with yearly seasonality, $m=12$), and $R_t$ is the remainder (the irregular part). When seasonal swings grow with the level of the series, a multiplicative form $y_t = T_t \times S_t \times R_t$ fits better, and taking logs turns it back into an additive problem because $\log y_t = \log T_t + \log S_t + \log R_t$. Every forecasting method in this chapter is, in effect, a different bet about how trend, seasonality, and noise combine. ## Splitting Time Honestly: Backtesting and Leakage The single most common error in applied forecasting is evaluating a model with a random split or with standard $k$-fold cross-validation. Both place future observations in the training set and past observations in the test set, which lets the model "peek" at information it would never have at deployment time. The result is an optimistic accuracy estimate that collapses in production. ::: {.callout-warning} Never shuffle a time series before splitting, and never use plain $k$-fold cross-validation on raw temporal data. Any procedure that lets a training point come from after a test point is leaking the future. ::: The honest alternative is rolling-origin evaluation, also called time-series cross-validation or backtesting. We pick an initial training window, forecast the next $h$ steps, record the error, then move the origin forward and repeat. There are two common variants: - Expanding window: the training set grows as the origin advances (use all history up to the origin). Good when more history always helps. - Sliding window: the training set keeps a fixed length and drops the oldest points. Good when the data-generating process drifts and old data is misleading. Formally, with origins $t = t_0, t_0+1, \dots, T-h$, the rolling-origin estimate of a loss $L$ at horizon $h$ is the average over origins, $$ \widehat{\text{CV}}_h = \frac{1}{N}\sum_{t=t_0}^{T-h} L\!\left(y_{t+h},\, \hat{y}_{t+h\mid t}\right), \qquad N = T - h - t_0 + 1 . $$ @fig-time-series-forecasting-rolling-origin illustrates the expanding-window scheme: each row is one origin, blue points are training, the orange point is the value being forecast, and gray points are not yet visible to the model. ```{r fig-time-series-forecasting-rolling-origin, fig.cap="Expanding-window rolling-origin evaluation. Each horizontal row is one forecast origin; the model trains on the blue points and is scored on the orange (future) point, while the gray points remain hidden. The origin advances downward, so no training point ever comes from after the point being scored.", fig.width=6.5, fig.height=4} set.seed(1) n_pts <- 16 origins <- 8:13 plot(NA, xlim = c(1, n_pts), ylim = c(length(origins) + 0.5, 0.5), xlab = "time index", ylab = "rolling origin (row)", yaxt = "n", main = "Rolling-origin (expanding window)") axis(2, at = seq_along(origins), labels = origins, las = 1) for (r in seq_along(origins)) { o <- origins[r] train <- 1:o test <- o + 1 future <- if (test < n_pts) (test + 1):n_pts else integer(0) points(train, rep(r, length(train)), pch = 19, col = "steelblue", cex = 1.1) points(test, r, pch = 17, col = "darkorange", cex = 1.5) if (length(future)) points(future, rep(r, length(future)), pch = 1, col = "gray70", cex = 1.1) } legend("topright", legend = c("train", "forecast target", "hidden future"), pch = c(19, 17, 1), col = c("steelblue", "darkorange", "gray70"), bty = "n", cex = 0.9) ``` @fig-time-series-forecasting-rolling-origin makes the rule visual: read any row left to right and the orange target always sits to the right of every blue training point. ::: {.callout-note} Leakage also hides inside feature engineering. If you scale features using the mean and standard deviation of the whole series, or impute missing values with a global statistic, the training rows have absorbed information from the test period. Compute every such statistic on the training window only, inside each fold. ::: ## Turning a Time Series into a Supervised Learning Problem Machine-learning regressors such as linear models, random forests, and gradient boosting do not know about time. They expect a feature matrix $\mathbf{X}$ and a target vector $\mathbf{y}$. The bridge from a series to that format is lag embedding: we define the target as the value $h$ steps ahead and the features as a window of recent past values. For one-step-ahead forecasting with $k$ lags, the design is $$ \underbrace{y_t}_{\text{target}} \;\longleftarrow\; \big(\, y_{t-1},\, y_{t-2},\, \dots,\, y_{t-k} \,\big), $$ so each training row is a length-$k$ snapshot of history paired with the next value. This is exactly the autoregressive idea: the order-$k$ autoregressive model $\text{AR}(k)$ is the special case where $f$ is linear, $$ \hat{y}_{t} = c + \sum_{j=1}^{k}\phi_j\, y_{t-j}. $$ Replacing that linear $f$ with a tree ensemble or any other learner gives a nonlinear autoregression, which is the workhorse of machine-learning forecasting. Two enrichments make lag features much stronger: - Rolling-window summaries. Beyond raw lags, summarize recent history with a moving statistic computed strictly from past values, for example a trailing mean over the last $w$ points, $\bar{y}_{t-1}^{(w)} = \frac{1}{w}\sum_{j=1}^{w} y_{t-j}$, or a trailing standard deviation. These smooth out noise and hand the model a ready-made notion of "recent level" and "recent volatility." - Calendar features. Encode the position in the seasonal cycle (month of year, day of week) so the model can learn seasonality directly. For monthly data the month index $1,\dots,12$ is enough for a tree; linear models prefer sine/cosine pairs such as $\sin(2\pi m_t / 12)$ and $\cos(2\pi m_t / 12)$. ::: {.callout-warning} Every feature must be computable at the forecast origin. A lag of order 1 uses $y_{t-1}$, which is fine. A "centered" moving average that averages $y_{t-1}, y_t, y_{t+1}$ is not: it uses the future. When in doubt, ask whether the feature could have been written down the moment before $y_t$ was observed. ::: ### Multi-Step Horizons: Recursive vs Direct Once we want $h > 1$ steps, there are two standard strategies. The recursive strategy trains a single one-step model and feeds its own predictions back as inputs: forecast $\hat{y}_{t+1\mid t}$, then treat it as if observed to forecast $\hat{y}_{t+2\mid t}$, and so on. It is simple and data-efficient but lets one-step errors compound across the horizon. The direct strategy trains a separate model for each horizon, with target $y_{t+h}$ regressed on features available at $t$, for each $h = 1, \dots, H$. It avoids error feedback at the cost of fitting $H$ models and ignoring the relationship between consecutive horizons. ::: {.callout-tip title="When to use this"} Use the recursive strategy when you have little data or a short horizon, and the direct strategy when the horizon is long, you have enough data to fit several models, and you can tolerate the extra bookkeeping. Hybrid schemes exist, but starting with recursive and checking whether direct helps is a sound default. ::: ## A Worked Comparison on AirPassengers We now forecast the built-in `AirPassengers` series: monthly totals of international airline passengers from 1949 to 1960, a classic example with a clear upward trend and strong yearly seasonality whose amplitude grows over time. We hold out the final 24 months as a test set, forecast them, and compare a naive seasonal baseline, exponential smoothing (ETS), ARIMA, and an `xgboost` regressor built on lag and calendar features. The multiplicative seasonality (swings widen as the level rises) is exactly the case where a log transform helps, so the machine-learning model is trained on $\log y_t$ and predictions are exponentiated back. ```{r time-series-forecasting-setup} suppressPackageStartupMessages({ library(forecast) library(xgboost) library(Metrics) }) data("AirPassengers") y <- as.numeric(AirPassengers) n <- length(y) freq <- 12L h_test <- 24L # hold out final 2 years train_idx <- 1:(n - h_test) test_idx <- (n - h_test + 1):n # Calendar position (month of year) for every observation month_of_year <- as.integer(cycle(AirPassengers)) ``` ### Building the Supervised Frame We construct lag features (orders 1, 2, 3, and the seasonal lag 12), trailing rolling summaries computed only from past values, and the month index. The helper below returns a data frame aligned so that row $t$ holds the target $y_t$ and features that were all knowable at time $t-1$. ```{r time-series-forecasting-features} make_supervised <- function(y, month, lags = c(1, 2, 3, 12), roll_windows = c(3, 12)) { N <- length(y) df <- data.frame(target = y, month = month) for (L in lags) { df[[paste0("lag", L)]] <- c(rep(NA, L), y[1:(N - L)]) } # Trailing rolling mean/sd that END at t-1 (no leakage) for (w in roll_windows) { rm_ <- rep(NA_real_, N) rs_ <- rep(NA_real_, N) for (t in seq_len(N)) { if (t - 1 >= w) { win <- y[(t - w):(t - 1)] rm_[t] <- mean(win) rs_[t] <- sd(win) } } df[[paste0("rollmean", w)]] <- rm_ df[[paste0("rollsd", w)]] <- rs_ } df } # Train on the log scale because seasonality is multiplicative sup <- make_supervised(log(y), month_of_year) sup_complete <- sup[stats::complete.cases(sup), ] sup_complete$row <- as.integer(rownames(sup_complete)) feat_cols <- setdiff(names(sup_complete), c("target", "row")) head(sup_complete[, c("row", "target", "lag1", "lag12", "rollmean3", "month")]) ``` ::: {.callout-note} The first 12 rows have no seasonal lag and are dropped by `complete.cases`. This is normal: lag features always cost you a warm-up period equal to the largest lag. ::: ### Fitting the Models We train ETS and ARIMA with the `forecast` package on the original scale (they handle the trend and seasonality internally), and an `xgboost` model on the engineered log-scale features. The seasonal naive forecast, which simply repeats the value from 12 months earlier, is our floor: any method that cannot beat it is not worth its complexity. ```{r time-series-forecasting-fit} train_ts <- ts(y[train_idx], frequency = freq, start = c(1949, 1)) # Classical baselines fit_snaive <- snaive(train_ts, h = h_test) fit_ets <- forecast(ets(train_ts), h = h_test) fit_arima <- forecast(auto.arima(train_ts), h = h_test) # xgboost on engineered features (recursive multi-step below) tr <- sup_complete[sup_complete$row <= max(train_idx), ] dtrain <- xgb.DMatrix(as.matrix(tr[, feat_cols]), label = tr$target) set.seed(42) xgb_fit <- xgb.train( params = list(objective = "reg:squarederror", max_depth = 3, eta = 0.1, subsample = 0.9, colsample_bytree = 0.9), data = dtrain, nrounds = 300, verbose = 0 ) ``` The boosting model is one-step-ahead by construction, so to cover the 24-month horizon we forecast recursively: predict the next month, append the prediction to the history, recompute the lag and rolling features, and step forward. ```{r time-series-forecasting-recursive} recursive_forecast <- function(model, y_log_hist, month_future, h, lags = c(1, 2, 3, 12), roll_windows = c(3, 12), feat_cols) { hist <- y_log_hist # log-scale history, grows as we predict preds <- numeric(h) for (i in seq_len(h)) { Ncur <- length(hist) row <- list(month = month_future[i]) for (L in lags) row[[paste0("lag", L)]] <- hist[Ncur - L + 1] for (w in roll_windows) { win <- hist[(Ncur - w + 1):Ncur] row[[paste0("rollmean", w)]] <- mean(win) row[[paste0("rollsd", w)]] <- sd(win) } xrow <- as.matrix(as.data.frame(row)[, feat_cols, drop = FALSE]) p <- predict(model, xgb.DMatrix(xrow)) preds[i] <- p hist <- c(hist, p) # feed prediction back } preds } month_future <- month_of_year[test_idx] xgb_log_pred <- recursive_forecast( xgb_fit, log(y[train_idx]), month_future, h_test, feat_cols = feat_cols ) xgb_pred <- exp(xgb_log_pred) # back to passenger counts ``` ### Visualizing the Forecasts @fig-time-series-forecasting-forecasts overlays the four forecasts on the held-out period. The seasonal naive forecast keeps the right shape but misses the growth; ETS and ARIMA track both trend and seasonality closely; the boosting model, trained only on lags and calendar position, recovers the seasonal pattern from data alone. ```{r fig-time-series-forecasting-forecasts, fig.cap="Forecasts of monthly AirPassengers counts over the 24-month hold-out period. The black line is the truth; colored lines are the seasonal naive, ETS, ARIMA, and xgboost forecasts. ETS, ARIMA, and xgboost all capture trend and seasonality, while the seasonal naive baseline misses the upward growth.", fig.width=7, fig.height=4.5} actual <- y[test_idx] time_test <- time(AirPassengers)[test_idx] plot(time_test, actual, type = "o", pch = 19, lwd = 2, ylim = range(actual, fit_ets$mean, fit_arima$mean, xgb_pred, fit_snaive$mean), xlab = "year", ylab = "passengers (thousands)", main = "Hold-out forecasts vs actual") lines(time_test, as.numeric(fit_snaive$mean), col = "gray50", lwd = 2, lty = 2) lines(time_test, as.numeric(fit_ets$mean), col = "steelblue", lwd = 2) lines(time_test, as.numeric(fit_arima$mean), col = "darkgreen", lwd = 2) lines(time_test, xgb_pred, col = "darkorange", lwd = 2) legend("topleft", bty = "n", lwd = 2, lty = c(1, 2, 1, 1, 1), col = c("black", "gray50", "steelblue", "darkgreen", "darkorange"), legend = c("actual", "seasonal naive", "ETS", "ARIMA", "xgboost")) ``` @fig-time-series-forecasting-forecasts is a single-split snapshot. To judge the methods fairly we must average over many origins, which is the next step. ### Rolling-Origin Evaluation We score each method with an expanding-window backtest at horizon $h=1$. Starting from an initial window of 100 months, we forecast the next month, advance the origin, and repeat to the end of the series. The error metrics are the mean absolute error (MAE) and the root mean squared error (RMSE), $$ \text{MAE} = \frac{1}{N}\sum_{i=1}^{N}\lvert e_i\rvert, \qquad \text{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N} e_i^{2}} , $$ and the mean absolute scaled error (MASE), which divides MAE by the in-sample MAE of the seasonal naive forecast so that a value below 1 means "better than the seasonal naive baseline." ```{r time-series-forecasting-backtest} init <- 100L origins <- init:(n - 1L) # one-step-ahead origins err_snaive <- err_ets <- err_arima <- err_xgb <- numeric(length(origins)) for (k in seq_along(origins)) { o <- origins[k] ytr <- ts(y[1:o], frequency = freq, start = c(1949, 1)) truth <- y[o + 1] # Seasonal naive: value from 12 months earlier sn <- y[o + 1 - freq] # ETS and ARIMA refit each origin (small series, fast) e1 <- as.numeric(forecast(ets(ytr), h = 1)$mean) a1 <- as.numeric(forecast(auto.arima(ytr), h = 1)$mean) # xgboost: rebuild supervised frame on log scale up to o, fit, predict 1 step sup_o <- make_supervised(log(y[1:o]), month_of_year[1:o]) sup_o <- sup_o[stats::complete.cases(sup_o), ] d_o <- xgb.DMatrix(as.matrix(sup_o[, feat_cols]), label = sup_o$target) set.seed(42) m_o <- xgb.train( params = list(objective = "reg:squarederror", max_depth = 3, eta = 0.1, subsample = 0.9, colsample_bytree = 0.9), data = d_o, nrounds = 200, verbose = 0 ) x1 <- recursive_forecast(m_o, log(y[1:o]), month_of_year[o + 1], 1, feat_cols = feat_cols) xg <- exp(x1) err_snaive[k] <- truth - sn err_ets[k] <- truth - e1 err_arima[k] <- truth - a1 err_xgb[k] <- truth - xg } # In-sample seasonal-naive MAE for MASE scaling scale_mae <- mean(abs(diff(y, lag = freq))) summarise_err <- function(e) { c(MAE = mean(abs(e)), RMSE = sqrt(mean(e^2)), MASE = mean(abs(e)) / scale_mae) } results <- rbind( `Seasonal naive` = summarise_err(err_snaive), `ETS` = summarise_err(err_ets), `ARIMA` = summarise_err(err_arima), `xgboost` = summarise_err(err_xgb) ) results <- round(results, 3) ``` @tbl-time-series-forecasting-results reports the rolling-origin errors across `r length(origins)` one-step forecasts. Lower is better on all three columns, and a MASE below 1 beats the seasonal naive baseline. ```{r tbl-time-series-forecasting-results} knitr::kable( results, caption = "Rolling-origin (expanding-window) one-step-ahead forecast accuracy on AirPassengers, averaged over all origins from month 100 to the end of the series. MAE and RMSE are in passenger-count units; MASE is scaled so that values below 1 beat the seasonal naive baseline." ) ``` @tbl-time-series-forecasting-results tells the realistic story: the statistical models (ETS, ARIMA) are very strong on a clean, strongly seasonal series like this one, and a generic boosting model trained on a handful of lag features is competitive but does not automatically dominate. That ordering is typical and worth internalizing. ::: {.callout-important title="Key idea"} On short, clean, strongly seasonal univariate series, well-tuned classical methods are hard to beat. Machine-learning forecasters earn their keep when you have many related series, many exogenous predictors, nonlinear interactions, or irregular structure that ARIMA and ETS cannot express. ::: ## Where Deep Sequence Models Fit When the data is long, high-dimensional, or shared across thousands of related series (retail demand across stores and products, sensor networks, energy load), deep models become attractive. Recurrent networks (LSTM, GRU) carry a hidden state forward through time, and Transformer-style architectures use self-attention to relate distant time points directly. These models can learn cross-series patterns through global training, which is exactly where the per-series classical methods struggle. The neural-network architectures behind them are developed in the chapters on neural networks (@sec-neural-networks) and Transformers (@sec-transformers); the code below is an illustrative Keras sketch of an LSTM forecaster and is not executed here, because it needs the Python/TensorFlow backend. ```{r time-series-forecasting-lstm-sketch, eval=FALSE} # Illustrative only (requires keras + TensorFlow Python backend). library(keras) # Suppose X has shape (samples, timesteps, features) and yv is the target. model <- keras_model_sequential() |> layer_lstm(units = 32, input_shape = c(dim(X)[2], dim(X)[3])) |> layer_dense(units = 1) model |> compile(optimizer = "adam", loss = "mse") model |> fit(X, yv, epochs = 50, batch_size = 16, validation_split = 0.2) preds <- model |> predict(X_future) ``` ::: {.callout-tip title="When to use this"} Reach for deep sequence models when you are forecasting many series jointly, have rich covariates, or need to capture long-range and nonlinear dependencies. For a single short series, they are usually overkill and underperform a tuned ETS or ARIMA. ::: ## Practical Guidance and Pitfalls Forecasting projects fail in predictable ways. The recurring lessons: - Always benchmark against naive methods. The seasonal naive and the last-value naive forecasts are free. If your elaborate model cannot beat them on a rolling-origin backtest, the complexity is not buying anything. - Evaluate with the deployment horizon. If you will forecast 12 weeks ahead, backtest at $h=12$, not $h=1$. One-step accuracy is a poor proxy for multi-step accuracy because errors compound. - Refit inside the backtest loop. Statistics used by features (means, scales) and model parameters must be recomputed on each training window. Fitting once on all data and then "evaluating" leaks the future, exactly the trap @tbl-time-series-forecasting-results avoids by refitting at every origin. - Mind non-stationarity. Trends and changing variance violate the assumptions of many models. Differencing ($\nabla y_t = y_t - y_{t-1}$), seasonal differencing, and log or Box-Cox transforms are the standard fixes; ARIMA's middle "I" is integrated differencing. - Watch the warm-up cost. Large lags shorten your usable training data. A seasonal lag of 12 throws away the first year before any row is complete. - Tune the right thing. For boosting, the number of rounds, tree depth, and learning rate interact; tune them with the same rolling-origin scheme you use for evaluation, never with a random split. - Prediction intervals matter. A point forecast without uncertainty is half a forecast. ETS and ARIMA give intervals analytically; for machine-learning models use quantile loss (see the quantile regression chapter, @sec-quantile-reg) or backtest the empirical error distribution. ::: {.callout-warning} The most damaging bugs in forecasting are silent: a feature that secretly uses the future, a scaler fit on the whole series, or a backtest that forgot to refit. They inflate accuracy on paper and only reveal themselves after deployment. Audit every feature against the question "could this have been computed at the origin?" before trusting any number. ::: ## Further Reading - Hyndman, R. J., and Athanasopoulos, G. (2021). *Forecasting: Principles and Practice*, 3rd edition. The standard open-access reference for ETS, ARIMA, and modern evaluation; the source of the MASE metric and the rolling-origin philosophy used here. - Box, G. E. P., Jenkins, G. M., Reinsel, G. C., and Ljung, G. M. (2015). *Time Series Analysis: Forecasting and Control*. The foundational treatment of ARIMA modeling. - Hyndman, R. J., Koehler, A. B., Ord, J. K., and Snyder, R. D. (2008). *Forecasting with Exponential Smoothing: The State Space Approach*. The theory behind the ETS framework. - Bergmeir, C., and Benitez, J. M. (2012). "On the use of cross-validation for time series predictor evaluation." *Information Sciences*. On why and when time-aware validation is necessary. - Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2020). "The M4 Competition: 100,000 time series and 61 forecasting methods." *International Journal of Forecasting*. Large-scale empirical comparison of statistical and machine-learning forecasters. - Januschowski, T., Gasthaus, J., Wang, Y., Salinas, D., Flunkert, V., Bohlke-Schneider, M., and Callot, L. (2020). "Criteria for classifying forecasting methods." *International Journal of Forecasting*. A clear taxonomy of local vs global and statistical vs machine-learning approaches.