Confidence Intervals

Summary

A Confidence Interval is a range of values, derived from a sample, within which the true population parameter is expected to lie. More precicely, the probability that a statistic taken from a random sample does not fall within a specified confidence interval is denoted by $α$ . In simpler terms, a confidence interval drawn from a sample, will capture the population statistic with a probability of $α$ . So in 20 samples, we would expect 19 of our $\overset{x}{ˉ}$ confidence intervals to contain the population mean $μ$ , iff the confidence intervals were 95%.

Take, for instance, a large yet finite population of fish. The average length of the fish in this population, while definite, is not possible to determine without measuring every single fish. If we were to draw 1000 samples, we'd expect that 900 of those sample means would fall within a 90% confidence interval.

It is important to be careful with the way we frame these probabilities. For example, if the data's 90% confidence interval ranged from 1 to 10, it would be incorrect to say that there's a 90% probability of the population mean falling within that range. This is because the population mean is a distinct, knowable value, it is not probalistic. The probability lies in the process of sampling. So, a more accurate statement would be, "if the population mean doesn't lie between 1 and 10, the probability of drawing such a sample was 10%".

The probability of a value within an interval, is given by the area under the curve for the corresponding distribution or PDF ¹. Confidence intervals are no different, they simply correspond to a different distribution, the sampling distribution. Revisiting the fish population experiment, if we extract a sample of these fish and compute the sample mean, the obtained value will provide a good estimate for the population mean. If multiple samples are taken and their respective means are calculated, a distribution of these sample means can be developed. The area under the curve of this distribution corresponds to the probability of observing such a sample mean, this is interval is a confidence interval.

Probability Distribution

Derivation

The Central Limit Theorem states that if sample size (n) is large enough, typically n > 30 ², the distribution of sample means will approach a $N (0, σ)$ . Since the sample mean is known follows this distribution, the confidence interval represents a specific range within this distribution, capturing a predefined proportion of random events.

Often $σ$ is an unknown value, much like $μ$ . We've seen that replacing $σ$ with $s$ results in a $t$ -distribution (see t-distribution), so confidence intervals will almost always be in terms of $t$ (unless the population variance is known, which is not common).

To derive the interval, re-arrange the z-score to represent bounds between an interval on the distribution:

$z t_{n - 1} t_{α, n - 1} s / n μ_{0} \sim \frac{x ˉ - μ _{0}}{σ _{SE}} \sim \frac{x ˉ - μ _{0}}{σ / n} \sim \frac{x ˉ - μ _{0}}{s / n} = \overset{x}{ˉ} - μ_{0} \in \overset{x}{ˉ} \pm \frac{s}{n} t_{α, n - 1}$

Examples

Theoretical: Population Mean From Sample Mean

Consider a population, we can calculate the confidence interval using the t-scores:

## Construct a populatio
population <- c(
                rep(0, 1000),
                rep(2, 1000),
                rep(30, 1000))

## Take a sampl
n    = 30
samp = sample(population, n)
xbar = mean(samp)
s = sd(samp)

alpha = 0.05  
t = abs(qt(alpha / 2, df=n-1))

# Compute the confidence limit
lower = xbar - t * (s / sqrt(n))
upper = xbar + t * (s / sqrt(n))
lower
upper

[1] 5.380967
[1] 15.81903

Note, the t.test function automatically performs a confidence interval:

## Conduct t-test and get confidence interval.
t_result = t.test(samp, conf.level = 0.95)
t_result$conf.int

If we repeat this many times, we can show that the confidence level is equal to the probability of an extreme sample:

## Construct a populatio
population <- c(
                rep(0, 1000),
                rep(2, 1000),
                rep(30, 1000))

number_of_extreme_samples <- replicate(10^3, {
    ## Take a sample
    n    = 30
    samp = sample(population, n)
    xbar = mean(samp)
    s = sd(samp)

    alpha = 0.05
    z = abs(qt(alpha / 2, df=n-1))

    ## Compute the confidence limits
    lower = xbar - z * (s / sqrt(n))
    upper = xbar + z * (s / sqrt(n))
    lower
    upper

    lower < mean(population) && mean(population) < upper
                })

mean(number_of_extreme_samples)

[1] 0.943

Bootstrapping: Confidence Intervals for Population Correlation

When a theoretical distribution for the sampling distribution is not known, it is possible to bootstrap a confidence interval. For example, consider the cars data:

speed	list
4	2
4	10
7	4
7	22
8	16
9	10
...	...

Let's assume that the R dataset is a population in and of itself. Let's take a sample use it to construct a confidence interval for the population mean. Unlike before, the formula for the confidence interval of correlation is not something most people have memorized. In this case we can bootstrap one.

The idea here is to assume that the population is an infinite repitition of the sample (acheived by sampling with replacement), we take samples from this bootstrapped population and use the quantiles of this psuedo-sampling-distribution as the confidence interval.

## Construct a populatio
population <- cars
head(cars)
n        = 30
rho <- cor(population$speed, population$dist)

take_sample <- function(p) population[sample(seq_len(nrow(population)), n), ]

## Take a sampl
samp     = take_sample(population)
rho_hat  = cor(samp$speed, samp$dist)

## Write a function to produce a sampling distributio
make_sampling_dist <- function(samp) {
    ## Bootstrap a confidence Interval for that sample
    replicate(10^3, {
        ## Take a sample with replacement (IMPORTANT)
        boot_samp <- samp[sample(n, replace = TRUE), ]

        ## Calculate the sample statistic
        cor(boot_samp$speed, boot_samp$dist)
    })
}

get_conf_interval <- function(samp, alpha) {
    lower = (1-alpha)/2
    upper = 1-lower

    quantile(make_sampling_dist(samp), c(lower, upper))
}

## Calculate the confidence interval
get_conf_interval(samp, 0.9)

       5%       95% 
0.6028544 0.8856464

Just like before, we can take this function for confidence intervals, and show that the probability of extreme samples is equal to the confidence of the interval:

## Construct a populatio
population <- cars
head(cars)
n        = 30
rho <- cor(population$speed, population$dist)

take_sample <- function(p) population[sample(seq_len(nrow(population)), n), ]

## Take a sampl
samp     = take_sample(population)
rho_hat  = cor(samp$speed, samp$dist)

## Write a function to produce a sampling distributio
make_sampling_dist <- function(samp) {
    ## Bootstrap a confidence Interval for that sample
    replicate(500, {
        ## Take a sample with replacement (IMPORTANT)
        boot_samp <- samp[sample(n, replace = TRUE), ]

        ## Calculate the sample statistic
        cor(boot_samp$speed, boot_samp$dist)
    })
}

get_conf_interval <- function(samp, alpha) {
    lower = (1-alpha)/2
    upper = 1-lower

    quantile(make_sampling_dist(samp), c(lower, upper))
}

## Calculate the confidence interval
get_conf_interval(samp, 0.9)


number_of_extreme_samples <- replicate(100, {
    ## Get a confidence Interval of a new sample
    int <- get_conf_interval(take_sample(population), 0.8)

    ## Did that confidence Interval capture the population statistic?
    int[1] < rho && rho < int[2]
})

mean(number_of_extreme_samples)

[1] 0.88

Unlike before, this value does not align with the confidence interval, that's because bootstrap methods are generally more likely to produce biased estimates when the sampling distribution is not normally distributed. This sampling distribution is right-skew, this could be inspected with:

hist(make_sampling_dist(samp))

There exist methods to overcome this limitation. ³

Linear Regression

Confidence Intervals of Linear Models

Confidence Interval of Linear Model

$Y \in B \pm t_{α, (n - p)} var (\hat{Y})$

A linear model is premised on the assumption that the data follows a Gaussian distribution, characterized by a consistent variance, while the mean is a linear function as follows:

$y_{i} ⟹ y_{i} \sim N (WX, σ) = WX + ε : ε \sim N (0, σ)$

The difference with confidence intervals for distributions that we typically use is that here, the mean is a function. This interpretation also helps to better understand the underlying assumptions of linear regression. The constant variance and Gaussian residuals aren't so much assumptions of the model but rather integral characteristics of the model itself.

Note

The confidence interval of a linear model, gives an interval through which we would expect different models to intersect. These models would correspond to different samples that we could have taken from the population.

Examples

Theoretical: Population Line From Sample Line

Consider the mtcars dataset:

head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

A multiple linear regression can be fit and a confidence interval drawn for the parameters like so:

# Compute the confidence limit
f <- mpg ~ cyl + hp + wt
mod <- lm(f, data=mtcars)

# Get the confidence intervals for the linear model
confint(mod)

                 2.5 %       97.5 %
(Intercept) 35.0915623 42.412012412
cyl         -2.0701179  0.186884238
hp          -0.0423655  0.006289293
wt          -4.6839740 -1.649972191

To predict a confidence interal on the values:

# Predict values and add confidence intervals
predict(mod, interval='confidence') |>
  head()

                         fit      lwr      upr
Mazda RX4         22.82043 21.65069 23.99016
Mazda RX4 Wag     22.01285 20.98250 23.04319
Datsun 710        25.96040 24.71014 27.21066
Hornet 4 Drive    20.93608 19.94927 21.92289
Hornet Sportabout 17.16780 15.75195 18.58365
Valiant           20.25036 19.12109 21.37962

Bootstrap

Values

# Bootstrap the linear model and calculate the coefficient estimates
bootstrap_estimates <- replicate(10^3, {
    ## Sample the data with replacement
    sample <- mtcars[sample(nrow(mtcars), replace = TRUE), ]
    ## Fit the linear model
    mod <- lm(mpg ~ cyl + hp + wt, data = sample)
    # Return the coefficients
    predict(mod, newdata = list(cyl = 4, hp = 100, wt = 2.6))
})

alpha = 0.9
lower = (1-0.9)/2
upper = 1 - lower
quantile(bootstrap_estimates, c(lower, upper))

      5%      95% 
23.43200 26.48408

Parameters

This has been ommited, the focus here is contrasting with prediction intervals. To see an example of bootstrapping the parameters, refer to the index.

Derivation

Deriving confidence intervals for linear regression is the same concept as before but I won't be showing it here. See generally the first 4 chapters of ⁴.

Footnotes

PDF: Probability Density Function

It's worth noting that extremely non-symmetric distributions will require larger sample sizes. 30 is sufficent for symmetric data though.

If we wanted we could adjust it with Fisher's z-transformation which would help reduce the bias. This is shown in the Appendix: Fisher's z-tranform. More generally there is also bias-corrected and accelerated (BCa) bootstraps, this is a method to deal with bootstraps of any skewed sampling distribution.

⁴

Bingham, N. H., and John M. Fry. Regression: Linear Models in Statistics. Springer Undergraduate Mathematics Series. London ; New York: Springer, 2010.

Environmental Informatics (MATH3005)