Prediction Intervals

Summary

A prediction interval provides a range of values that a future individual observation will lie between, with a certain probability.

Prediction intervals are ranges that estimate the future individual observations to lie with a certain probability. These intervals are used in statistical modelling and machine learning to represent the uncertainty in predictions. This range differs from a confidence interval, which estimates the range of a population parameter rather than future observations.

Probability Distribution

Derivation

Examples

Normal Distribution

Problem

The list below shows aresenic concentrations (ppb) collected every two months at two groundwater monitoring wells:

Well	Year	Arsenic (ppb)
Background	1	`[12.6, 30.8, 52.0, 28.1]`
Background	2	`[33.3, 44.0, 3.0, 12.8]`
Background	3	`[58.1, 12.6, 17.6, 25.3]`
Compliance	4	`[48.0, 30.3, 42.5, 15.0]`
Compliance	5	`[47.6, 3.8, 2.6, 51.9]`

Summary statistics show year 4 has a higher sample mean and lower variance:

apply(df, 2, \(x) c("mean" = mean(x), "sd" = sd(x)))

	y1	y2	y3	y4	y5
mean	30.87	23.27	28.40	33.95	26.47
sd	16.20	18.71	20.47	14.63	26.93

Is there evidence of contamination for the compliance well in year 4?

Theoretical

Note

This section is adapted from ¹

Given a sample from a normal distribution with a known variance

σ^{2}

, we know:

$\overset{x}{ˉ} \sim N (μ, \frac{1}{n})$
$x \sim N (μ, 1 + \frac{1}{n})$

$y_{n + 1} \overset{y}{ˉ} y_{n + 1} - \overset{y}{ˉ} \frac{y _{n + 1} - y ˉ}{1 - \frac{1}{n}} ⟹ y_{n + 1} \sim N (μ, σ) \sim N (μ, \frac{σ}{n}) \sim N (mean (μ - μ), σ^{2} + \frac{σ ^{2}}{n}) Assume the mean value and next value are independent \sim N (0, var (σ 1 + \frac{1}{n})) \sim N (0, σ) \in \overset{y}{ˉ} \pm z_{α} 1 + n$

If $σ$ is not known (which is likely), it can be shown ²:

$\frac{y _{n + 1}}{s} \sim t_{α, n - 1}$

This gives us:

$\overset{y}{ˉ} \pm s_{y} \cdot t_{α} 1 + \frac{1}{n}$

This, however, is only a one step prediction. To take multiple predictions forward is yet more work. The 95% probability would correspond to the product of both observations and the interval would have to be adjusted accordingly. This is complicated to implement, so instead we use the Bonferroni Correction. Broadly speaking, the Bonferroni Correction offsets the fallacy of multiple comparisons by dividing the quantile by the number of comparisons, so instead the interval would be given by ³:

Bonferroni Prediction Interval of Normal Distribution

$\overset{y}{ˉ} \pm s_{y} \cdot t_{α / k} 1 + \frac{1}{n}$

The help for EnvStats suggests this method will be more conservative:

Approximate Method Based on the Bonferroni Inequality (method="Bonferroni")

As shown above, when k=1k=1, the value of KK is given by Equation (8) or Equation (9) for two-sided or one-sided prediction intervals, respectively. When k>1k>1, a conservative way to construct a (1−α∗)100%(1−α∗)100% prediction interval for the next kk observations or averages is to use a Bonferroni correction (Miller, 1981a, p.8) and set α=α∗/kα=α∗/k in Equation (8) or (9) (Chew, 1968). This value of KK will be conservative in that the computed prediction intervals will be wider than the exact predictions intervals. Hahn (1969, 1970a) compared the exact values of KK with those based on the Bonferroni inequality for the case of m=1m=1 and found the approximation to be quite satisfactory except when nn is small, kk is large, and αα is large. For example, Gibbons (1987a) notes that for a 99% prediction interval (i.e., α=0.01α=0.01) for the next kk observations, if n>4n>4, the bias of KK is never greater than 1% no matter what the value of kk.

However, prediction intervals will always be too narrow ⁴, so using the Bonferroni correction is no less accurate and it is simpler to implement (even if libraries do it, we must verify our results). The EnvStats package includes both the bonferroni method and the exect method, for more information on the exact method see ⁵. The formula for the exact method is included below, but is not needed, this formula is adapted from ⁶:

Unnecessary math

$1 - α = \int_{0}^{\infty} (F (d s, k, ρ) h (s n - 1, n - 1) n - 1) d s$

Where:

$F (x, k, ρ) = \int_{- \infty}^{\infty} ([Φ (\frac{x + y ρ}{1 - ρ})]^{k} ϕ (y)) d y$
$ρ = \frac{1}{\frac{n}{m} + 1}$
$h (x, v) = \frac{x ^{v - 1} e x p ( \frac{- x ^{2}}{2} )}{2 ^{(ν /2) - 1} Γ ( \frac{ν}{2} )}$
$ϕ$ is the pdf of the standard normal distribution
$Φ$ is the cumulative distribution function of the standard normal distribution.

Putting this all together, in R:

df <- data.frame(
    y1 = c(12.6, 30.8, 52.0, 28.1),
    y2 = c(33.3, 44.0, 3.0,  12.8),
    y3 = c(58.1, 12.6, 17.6, 25.3),
    y4 = c(48.0, 30.3, 42.5, 15.0),
    y5 = c(47.6,  3.8,  2.6, 51.9))

y <- background <- c(df$y1, df$y2, df$y3)
compliance <- c(df$y3, df$y4)

pred_int_normal_population <- function(y, k, conf_level) {
    n <- length(y)
    a <- (1-conf_level) / 2
    mean(y) + c(1, -1) * sd(y) * qt(p = a / k, df = n - 1) * sqrt(1 + 1/n)
}

pred_int_normal_population(y,            k = 4, conf_level = 0.95)
## EnvStats::predIntNorm(x = y, n.mean = 1, k = 4, conf.level = 0.95)

[1] -25.54131  80.57465

Prediction Intervals of Other Models

Note

Adapted from §5.5 of ⁷

Generally speaking, a prediction interval is given by:

$\overset{y}{^}_{T + h ∣ T} \pm t_{α / k} \overset{σ}{^}_{h},$

where:

$\overset{σ}{^}_{h} = K \overset{σ}{^}$
$\overset{σ}{^} = \frac{1}{T - K - M} \sum_{t = 1}^{T} e_{t}^{2}$
- $\approx$ sd(y-yhat)

Different K values can be solved:

Method	$h$ -step forecast standard deviation
Mean	$\overset{σ}{^}_{h} = \overset{σ}{^} 1 + 1/ T$
Naive	$\overset{σ}{^}_{h} = \overset{σ}{^} h$
Seasonal Naive	$\overset{σ}{^}_{h} = \overset{σ}{^} k + 1$
Drift	$\overset{σ}{^}_{h} = \overset{σ}{^} h (1 + h / (T - 1))$

This will be handled for us by the forecast or EnvStats package.

Using `predict`

It's also possible to fit a stationary linear model, which is a mean-value time series, and call the predict function:

## y is the variable of concern
y <- rnorm(10)

## Generate some indices
x <- 1:10
 
## Fit a line
## The slope should be zero, so the model will fit only intercept
model <- lm(y ~ 0)

# Now create a new data frame with the new x values
new_x <- data.frame(x = 11:14)

# Use the 'predict' function to make prediction
predictions <- predict(model, new_x, interval = "prediction")

print(predictions)

EnvStats Package

Note

This is the approach to use, because it is the simplest.


df <- data.frame(
    y1 = c(12.6, 30.8, 52.0, 28.1),
    y2 = c(33.3, 44.0, 3.0,  12.8),
    y3 = c(58.1, 12.6, 17.6, 25.3),
    y4 = c(48.0, 30.3, 42.5, 15.0),
    y5 = c(47.6,  3.8,  2.6, 51.9))


library(EnvStats)
EnvStats::predIntNorm(
            c(df$y1, df$y2, df$y3), # Flatten the values
            n.mean = 1,             # 1 for 1 observation
            k = 4,                  # Make 4 Future Predictions
            conf.level = 0.95       # The probability for the interval
            )

Results of Distribution Parameter Estimation
--------------------------------------------

Assumed Distribution:            Normal

Estimated Parameter(s):          mean = 27.51667
                                 sd   = 17.10119

Estimation Method:               mvue

Data:                            c(df$y1, df$y2, df$y3)

Sample Size:                     12

Prediction Interval Method:      Bonferroni

Prediction Interval Type:        two-sided

Confidence Level:                95%

Number of Future Observations:   4

Prediction Interval:             LPL = -25.54131
                                 UPL =  80.57465

Each of the values [48.0, 30.3, 42.5, 15.0] is within $(- 26, 81)$ so there is insufficient evidence to conlcude there was contamination in year 4.

If we wanted to look over the next 8 years, the prediction interval would get wider:

library(EnvStats)
x = EnvStats::predIntNorm(
            c(df$y1, df$y2, df$y3), # Flatten the values
            n.mean = 1,             # 1 for 1 observation
            k = 8,                  # Make 4 Future Predictions
            )


lpl <- x$interval$limits[1]
upl <- x$interval$limits[2]
obs <- c(df$y4, df$y5)
if (sum(lpl < obs & obs < upl)) {
  print("No Evidence")
} else {
  cat("Evidence to Reject H0")
}

Results of Distribution Parameter Estimation
--------------------------------------------

Assumed Distribution:            Normal

Estimated Parameter(s):          mean = 27.51667
                                 sd   = 17.10119

Estimation Method:               mvue

Data:                            c(df$y1, df$y2, df$y3)

Sample Size:                     12

Prediction Interval Method:      Bonferroni

Prediction Interval Type:        two-sided

Confidence Level:                95%

Number of Future Observations:   8

Prediction Interval:             LPL = -32.47117
                                 UPL =  87.50450

Log-Normal Distribution

All values are strictly positive, this may suggest a log-normal distribution is more appropriate.

Theoretical

This can be solved by using a log transform and then exponentiating the intervals:

pred_int_log_normal_population <- function(y, k, conf_level) {
    n <- length(y)
    a <- (1-conf_level) / 2

    y  <- log(y)
    SE <- sd(y) * sqrt(1 + 1/n)
    t  <- qt(p = a / k, df = n - 1)
    intervals <- mean(y) + c(1, -1) *  t * SE
    exp(intervals)
}

1.679715  278.144795

EnvStats

EnvStats::predIntLnorm(x = y, k = 4, conf.level = 0.95)

Results of Distribution Parameter Estimation
--------------------------------------------

Assumed Distribution:            Lognormal

Estimated Parameter(s):          meanlog = 3.0733829
                                 sdlog   = 0.8234277

Estimation Method:               mvue

Data:                            y

Sample Size:                     12

Prediction Interval Method:      Bonferroni

Prediction Interval Type:        two-sided

Confidence Level:                95%

Number of Future Observations:   4

Prediction Interval:             LPL =   1.679715
                                 UPL = 278.144795

>

Built-in Function: Population Value From Sample Value

Linear Regression

Prediction Intervals of Linear Models

Prediction Interval of Linear Model

$Y_{i} \in B \pm t_{α, (n - p)} var (Y - \hat{Y})$

Examples

Theoretical: Next Value from a Model

The prediction interval for a linear value is given by ⁸:

$\overset{y}{^}_{h} \pm t_{(1 - α /2, n - 2)} \times MSE \times (1 + \frac{1}{n} + \frac{( x _{i} - x ˉ ) ^{2}}{\sum _{j = 1}^{n} ( x _{j} - x ˉ ) ^{2}})$


set.seed(123) # Ensuring reproducibility
n = 100 # Defining n
x = 1:n
m = 3
err = rnorm(n) # Creating error term
y = m * x + err # Generating y
data = data.frame(x = x, y = y) # Creating a data frame

x_i = 7 # This the point we are interested in

# Calculate prediction interval
fit = lm(y ~ x, data) # Fitting the model
predict(fit, newdata = list(x = x_i), interval = "predict", level = 0.95)

# Calculate prediction interval by hand
SE = summary(fit)$sigma # Extract the standard error from the model fit
hat_y_i = coef(fit)["(Intercept)"] + coef(fit)["x"] * x_i # Manual prediction using the model coefficients
t_val = qt(p = 0.975, df = n - 2) # t-value for 95% confidence level with n - 2 df [fn_df]
pred_int = hat_y_i + c(fit = 0, lwr = -1,  upr = 1) * SE * sqrt(1 + 1/n + (x_i - mean(x))^2 / (n-1)*sd(x)) * t_val # Prediction Interval

# Results
pred_int


## footnotes
## [fn_df] Each parameter in the model consumes a degree of freedom
## 1 for intercept + 1 for slope
## See generally ch. 2 and 3 of
## Bingham, N. H., and John M. Fry. Regression: Linear Models in Statistics. Springer Undergraduate Mathematics Series. London ; New York: Springer, 2010.

     fit      lwr      upr
20.98117 19.13687 22.82548
     fit      lwr      upr 
20.98117 19.13687 22.82548

In the interest of completeness, the SE could also be calculated from first principles:

## Define a function to get Standard Error from linear regression
linear_SE <- function(model) {
    ## Residual Sum of Squares (RSS)
    ## Could also do
    ## sum((y - predict(model))^2)
    rss = sum(model$residuals^2)

    # Degrees of Freedom
    df = n - length(coef(model))

    # Compute Standard Error (SE)
    SE = sqrt(rss/df)

    ## Should be equal to
    ## Compare with SE from summary
    ## SE = summary(model)$sigma
}

Built-in Function: Next Value from a Model

Base Plots

# Build the linear mod
m <- lm(dist ~ speed, data = cars)

# Create a Prediction and Confidence interval for new data on the mod
alpha <- 0.95
newdata <- list(speed = cars$speed)
make_pred <- function(type) {
    predictions = predict(m, newdata = newdata, interval = type, level = alpha)
    data.frame(speed = cars$speed, predictions)
}

# Create datafra
pred_df <- make_pred("prediction")
conf_df <- make_pred("confidence")

# Plot the Da
plot(dist ~ speed,
    data = cars,
    xlim = range(speed),
    ylim = range(dist),
    main = "Prediction and Confidence Intervals",
    xlab = "Speed",
    ylab = "Distance"
)

# Add the prediction interv
cols = c(pred = "red", conf = "blue")
with(pred_df, {
    lines(speed, fit, lwd = 2)
    lines(speed, lwr, lty = 2, col = cols["pred"])
    lines(speed, upr, lty = 2, col = cols["pred"])
})
# Add the Confidence Interv
with(conf_df, {
    lines(speed, fit, lwd = 2)
    lines(speed, lwr, lty = 2, col = cols["conf"])
    lines(speed, upr, lty = 2, col = cols["conf"])
})

## Add a lege
legend("topright",
    legend = c("Prediction", "Confidence"),
    lty = c(2, 2), col = cols, lwd = 2
)

Tidyverse and GGPlot

# Load necessary librari
library(ggplot2)
library(tidyverse)
library(broom)

# Fit a linear model (Predict mpg using a single predictor, e.g., w
fit <- lm(mpg ~ wt, data=mtcars)

newdata <- data.frame(wt = seq(min(mtcars$wt), max(mtcars$wt), len=100))

conf_int <- predict(fit, newdata = newdata, interval = 'confidence') %>%
  as_tibble() %>%
  mutate(wt = newdata$wt)

pred_int <- predict(fit, newdata = newdata, interval = 'prediction') %>%
  as_tibble() %>%
  mutate(wt = newdata$wt)

# Making ggplot including confidence interv
ggplot() +
  geom_point(data = mtcars, aes(x=wt, y=mpg)) +
  geom_line(data=conf_int, aes(x=wt, y=fit)) +
  geom_ribbon(data=pred_int, aes(x = wt, y = fit, ymin=lwr, ymax=upr), alpha=0.3, fill = "royalblue") +
  geom_ribbon(data=conf_int, aes(x = wt, y = fit, ymin=lwr, ymax=upr), alpha=0.3, fill = "indianred") +
  labs(title="Confidence Interval of MPG for Weight", x="Weight", y="Miles Per Gallon") +
  theme_classic()

ggsave("../assets/images/confidence_and_prediction_interval.png")

Bootstrap

A simple bootstrap for the prediction interval of a linear model is included below. Note this assumes constant variance across the model, see ⁷ for a more detailed walkthrough of bootstrapping prediction intervals.

pred_vals <- replicate(10 ^ 4, {
  n = nrow(my_cars)
  form <- dist ~ speed
  ## Re-Sample the cars with replacement
  s = sample(1:n, replace = TRUE)
  cars_boot <- my_cars[s,]

  ## Fit a Linear Model
  mod <- lm(form, cars_boot)

  inter <- coef(mod)[1]
  slope <- coef(mod)[2]

  ## Get the predicted value at x=10
  pred_val <- 10 * slope + inter
  
  ## Find the residuals of the bootstrap model
  residuals <- resid(mod)
  
  ## Randomly select one of the residuals
  selected_resid <- sample(residuals, 1)
  
  ## Add the selected residual to the predicted value
  pred_val_selected_resid <- pred_val + selected_resid

  pred_val_selected_resid

})

hist(pred_vals)

quantile(pred_vals, c(0.025, 0.975))

## Confidence Interval
form <- dist ~ speed
fit <- lm(form, my_cars)
stats::predict(
  fit,
  newdata = list(speed = 10),
  interval = "prediction",
  levels = c(0.025, 0.975)
)

Derivation

Note

Adapted from ⁹.

We define Linear regression for matrices $X$ and $Y$ as:

$Y = BX$

The closed form solution for $X$ is:

$B = (X^{T} X)^{- 1} X^{T} Y$

We define:

Population Slope
- $β$
Sample Slope
- $B$

In the context of linear regression both $β$ and $B$ represent matrices of coefficients. If $size (Y) = n \times 1$ then the coefficient matrices will simply be vectors.

As a first step, we must solve the covariance of the weights matrix, before reading further make sure to review Covariance.

To begin, recall that the sum of variance in two vectors is the sum of the variance in each plus double the covariance:

$var (X - Y) = i = 1 \sum n [(x - y - (\overset{x}{ˉ} - \overset{y}{ˉ}))^{2}] = i = 1 \sum n [x^{2} - 2 x y + y^{2} - 2 x \overset{x}{ˉ} + 2 x \overset{y}{ˉ} + 2 \overset{x}{ˉ} y + 2 y \overset{y}{ˉ} + \overset{x}{ˉ}^{2} - 2 \overset{x}{ˉ} y + \overset{y}{ˉ}^{2}] = i = 1 \sum n [x^{2} - 2 x \overset{x}{ˉ} + \overset{x}{ˉ}^{2}] - 2 i = 1 \sum n [(x - \overset{x}{ˉ}) (y - \overset{y}{ˉ})] + i = 1 \sum n [y^{2} - 2 y \overset{y}{ˉ} + \overset{y}{ˉ}^{2}] = var (x) - 2 cov (x, y) + var (y)$

We can decompose the weights matrix to include the population weights and the noise in the observation:

$B = (X^{T} X)^{- 1} X^{T} Y = (X^{T} X)^{- 1} X^{T} (Xβ + ε) = (X^{T} X)^{- 1} X^{T} Xβ + (X^{T} X)^{- 1} X^{T} ε = (X^{T} X)^{- 1} (X^{T} X) β + (X^{T} X)^{- 1} X^{T} ε = β + (X^{T} X)^{- 1} X^{T} ε$

This can be used to show the weights matrix is an unbiased estimator of the population weights:

$E (B) = E [β + (X^{T} X)^{- 1} X^{T} ε] = E [β] + E [(X^{T} X)^{- 1} X^{T} ε] = β + (X^{T} X)^{- 1} X^{T} E [ε] The expected value of ε is 0 = β + (X^{T} X)^{- 1} X^{T} 0 = β$

Re-arranging, this gives the difference between the sample and population weights matrix:

$B - E [B] (B - E [B])^{T} = B - β = β + (X^{T} X)^{- 1} X^{T} ε - β = (X^{T} X)^{- 1} X^{T} ε = ((X^{T} X)^{- 1} X^{T} ε)^{T} = (X^{T} ε)^{T} ((X^{T} X)^{- 1})^{T} = εX ((X^{T} X)^{T})^{- 1} = εX (X^{T} X)^{- 1}$ Finally, we can solve the covariance matrix in terms of the population standard deviation: $cov (B) = E [(B - E [B]) (B - E [B])^{T}] = (X^{T} X)^{- 1} X^{T} E [ε ε^{T}] X (X^{T} X)^{- 1} = (X^{T} X)^{- 1} X^{T} σ^{2} I X (X^{T} X)^{- 1} = σ^{2} (X^{T} X)^{- 1} X^{T} X (X^{T} X)^{- 1} = σ^{2} (X^{T} X)^{- 1}$

The confidence interval depends on the variance in model predictions:

$var (\overset{y}{^}_{0}) = E [(X_{0} B - E [X_{0} B]) (X_{0} B - E [X_{0} B])^{T}] = X_{0} E [(B - E [B]) (B - E [B])^{T}] X_{0}^{T} = X_{0} cov (B) X_{0}^{T} = σ^{2} X_{0} (X^{T} X)^{- 1} X_{0}^{T}$

This is the standard error used to solve the confidence interval.

The prediction interval is the standard error in the residuals, which is given by:

$var (y_{0} - \overset{y}{^}_{0}) = var (y_{0}) - 2 cov (y_{0}, \overset{y}{^}_{0}) + var (\overset{y}{^}_{0}) = var (ε_{0}) - 2 cov (ε_{0}, \overset{y}{^}_{0}) + var (\overset{y}{^}_{0}) = var (ε_{0}) - 2 X_{0} cov (ε_{0}, B) + var (\overset{y}{^}_{0}) In Linear regression, the variance of the error is independent of the slope, so covariance is 0 = var (ε_{0}) - 0 + var (\overset{y}{^}_{0}) = σ^{2} + σ^{2} X_{0} (X^{T} X)^{- 1} X_{0}^{T} = σ^{2} (1 + X_{0} (X^{T} X)^{- 1} X_{0}^{T})$

Using this we can solve the confidence interval with:

$\overset{x}{ˉ} \pm z_{α} SE$

Of course, we don't know $σ$ so we use t, it can be shown : ¹⁰

$\frac{β - β ^}{σ ^} = \frac{N ( 0 , C ^{- 1} )}{χ _{n - p}^{2} / ( n - p )} \sim t_{n - p}$

which implies

confidence interval:

$Y \in B \pm t_{α, (n - p)} var (\hat{Y})$
prediction interval:

$Y_{i} \in B \pm t_{α, (n - p)} var (Y - \hat{Y})$

Where $C = (X^{T} X)^{- 1}$ is the control matrix.

Footnotes

⁸

https://online.stat.psu.edu/stat501/lesson/3/3.3

https://en.wikipedia.org/wiki/Prediction_interval#Estimation_of_parameters

⁵

Dunnett, C.W. (1955). A Multiple Comparisons Procedure for Comparing Several Treatments with a Control. Journal of the American Statistical Association 50, 1096-1121. cited in ¹¹

predIntNormK in https://search.r-project.org/CRAN/refmans/EnvStats/html/predIntNormK.html cited predIntNorm help page, see generally the manual Millard, Steven P. EnvStats: An R Package for Environmental Statistics. New York: Springer, 2013.

⁴

Rob J Hyndman's blog (Author of ⁷) https://robjhyndman.com/hyndsight/narrow-pi/ cited in https://joaquinamatrodrigo.github.io/skforecast/0.4.3/notebooks/prediction-intervals.html

⁷

Hyndman, Rob J., and George Athanasopoulos. Forecasting: Principles and Practice. Third print edition. Melbourne, Australia: Otexts, Online Open-Access Textbooks, 2021. https://otexts.com/fpp3/#.

⁶

Ch. 6, eq 6.22 Millard, Steven P., and Nagaraj K. Neerchal. Environmental Statistics with S-Plus. Applied Environmental Statistics. Boca Raton: CRC Press, 2001.

TODO, see generally https://en.wikipedia.org/wiki/Prediction_interval#Estimation_of_parameters

¹⁰

p. 89 § 3.7 of Bingham, N. H., and John M. Fry. Regression: Linear Models in Statistics. Springer Undergraduate Mathematics Series. London ; New York: Springer, 2010.

⁹

https://real-statistics.com/multiple-regression/confidence-and-prediction-intervals/confidence-and-prediction-intervals-proofs/

Environmental Informatics (MATH3005)