Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Confidence intervals



STA 032: Gateway to data science Lecture 20

Jingwei Xiong

May 17, 2023

1 / 33
2 / 33

Recap

  • Confidence intervals:

    • Introduction and interpretation

    • Construction using Central Limit Theorem: (¯xzα2σn,¯x+zα2σn)

3 / 33

Today

  • More on confidence intervals:

    • Changing n and alpha

    • Simulation example

    • σ2 unknown

    • Confidence interval for population proportion

4 / 33

Confidence Interval for Population Mean

  • Set up: Xi are independent and identically distributed with population mean μ and variance, σ2. We are interested in a 100(1α)% confidence interval for the unknown population parameter μ. We use the sample mean, ¯X, constructed from a sample of size n, as an estimator for the population mean. Assume n is large.

We showed that P(¯Xzα2σnμ¯X+zα2σn)=1α

And hence after collecting data and computing the sample mean, ˉx, an α-level confidence interval for μ is

(¯xzα2σn,¯x+zα2σn)

This relied on the Central Limit Theorem: when n large, Z=¯Xμσ/nN(0,1), in other words ¯XN(μ,σ2n)

5 / 33

Example

Let X1,X2,...,X200 be independent N(μ,4) random variables. We collect the sample of size 200, and the resulting sample mean, ¯x, is ¯x=24. What is a 95% confidence interval for μ?

Since n=200 is large, by CLT, a 95% confidence interval for μ is given by (¯xzα2σn,¯x+zα2σn)

Substitute ¯x=24, zα2=1.96, σ=2, n=200, giving (241.962/200,24+1.962/200) or (23.72, 24.28)

We are 95% confident that μ falls within the interval (23.72, 24.28).

6 / 33

Confidence Interval Width: changing n

(¯xzα2σn,¯x+zα2σn)

For a given zα2, confidence intervals that are narrower indicate greater certainty in estimated values. We can get narrower intervals by increasing n, the sample size.

7 / 33

Earlier example

We had n=200, ¯x=24, σ=2, and our 95% confidence interval for μ was (23.72, 24.28).

What about when n=2000?

(¯xzα2σn,¯x+zα2σn)

Substitute ¯x=24, zα2=1.96, σ=2, n=2000, giving (241.962/2000,24+1.962/2000) or (23.91, 24.09)

8 / 33

Confidence Interval Width: changing alpha

While 95% confidence intervals are the most common, it is simple to generate other intervals, for example 99% intervals, by replacing the critical value. E.g., for a 99% interval, we need the z-score that cuts off the upper 0.005 of the distribution, which is 2.58.

qnorm(.995)
[1] 2.575829

In our earlier example, we had n=200, ¯x=24, σ=2, and our 95% confidence interval for μ was (23.72, 24.28). When α=.01, a 99% confidence interval is (242.582/200,24+2.582/200) or (23.64, 24.36).

9 / 33

Confidence Interval Width: changing alpha

In a recent study of 50 randomly selected statistics students, they were asked the number of hours per week they spend studying for their statistics classes. The results were used to estimate the mean time for all statistics students with 90%, 95% and 99% confidence intervals. These were (not necessarily in the same order): (7.5,8.5)   (7.6,8.4)   (7.7,8.3).

Which interval is which?

10 / 33

Simulation example

Let X be normally distributed with mean μ=3 and variance σ2=25, i.e., XN(3,52)

myDraws <- rnorm(1000, mean = 3, sd = 5)
xBar <- mean(myDraws)
n <- length(myDraws)
halfWidth <- qnorm(.975)*5/sqrt(n)
(lower <- xBar - halfWidth)
[1] 2.430562
(upper <- xBar + halfWidth)
[1] 3.050357

Hence a 95% confidence interval for μ is (2.4305619, 3.0503569).

11 / 33

Simulation example: 5000 confidence intervals

set.seed(0)
CIsDF <- data.frame(lower = rep(NA, 5000), upper = rep(NA, 5000))
for (i in 1:nrow(CIsDF)) {
myDraws <- rnorm(1000, mean = 3, sd = 5)
xBar <- mean(myDraws)
n <- length(myDraws)
halfWidth <- qnorm(.975)*5/sqrt(n)
CIsDF[i, "lower"] <- xBar - halfWidth
CIsDF[i, "upper"] <- xBar + halfWidth
}
12 / 33

Simulation example: 5000 confidence intervals

How many of the 5000 CIs do we expect include the true population mean, μ=3?

13 / 33

Simulation example: 5000 confidence intervals

How many of the 5000 CIs do we expect include the true population mean, μ=3?

  • Recall interpretation of confidence intervals: If we were to repeat this procedure a large number of times, sampling and constructing confidence intervals in the same way, 95% of constructed intervals would contain the true population parameter.

  • If we repeat the experiment 5,000 times, i.e., draw samples and construct 5,000 confidence intervals, we would expect 4,750 of these to contain the true population parameter

CIsDF %>%
mutate(index = 1:nrow(CIsDF),
estimate = (lower + upper)/2) %>%
slice(1:20) %>%
ggplot(aes(estimate, index)) +
geom_pointrange(aes(xmin = lower, xmax = upper)) +
geom_vline(xintercept = 3, colour = "grey60", linetype = 2) +
labs(title = "First 20 CIs for population mean",
x = "Estimate",
y = "Index")

13 / 33

Simulation example: 5000 confidence intervals

sum(CIsDF$lower <= 3 & CIsDF$upper >= 3)
[1] 4769
sum(CIsDF$lower <= 3 & CIsDF$upper >= 3) / nrow(CIsDF)
[1] 0.9538
14 / 33

Simulation example: 5000 confidence intervals, 90% level of confidence

set.seed(0)
CIsDF <- data.frame(lower = rep(NA, 5000), upper = rep(NA, 5000))
for (i in 1:nrow(CIsDF)) {
myDraws <- rnorm(1000, mean = 3, sd = 5)
xBar <- mean(myDraws)
n <- length(myDraws)
halfWidth <- qnorm(.95)*5/sqrt(n)
CIsDF[i, "lower"] <- xBar - halfWidth
CIsDF[i, "upper"] <- xBar + halfWidth
}
sum(CIsDF$lower <= 3 & CIsDF$upper >= 3)
[1] 4542
sum(CIsDF$lower <= 3 & CIsDF$upper >= 3) / nrow(CIsDF)
[1] 0.9084
15 / 33

What happens if the population variance is unknown?

  • The confidence interval involves σ2; most of the time, this is unknown

  • Recall that we can use the sample variance, s2=1n1ni=1(xiˉx)2, to estimate σ2

  • Before collecting the data: S2=1n1ni=1(Xi¯X)2

  • If we estimate σ2 using S2, then we can't use the Central Limit Theorem in the same way

  • Two ways we can make progress:

    1. When n is large
    2. When n is small
16 / 33

First case (n large)

  • If n is large, CLT still holds (for the σ2 version), and using another theorem (out of scope for this class), we can prove that ¯XμS/nN(0,1)

  • Notice that the only difference is that we have replaced σ by S.

  • An α-level confidence interval for μ is (¯xzα2sn,¯x+zα2sn)

  • Just like before, it doesn't matter what the distribution of Xi is

17 / 33

Previous example: σ known

Let X1,X2,...,X200 be independent N(μ,4) random variables. We collect the sample of size 200, and the resulting sample mean, ¯x, is ¯x=24. What is a 95% confidence interval for μ?

Since n=200 is large, by CLT, a 95% confidence interval for μ is given by (¯xzα2σn,¯x+zα2σn)

Substitute ¯x=24, zα2=1.96, σ=2, n=200, giving (241.962/200,24+1.962/200) or (23.72, 24.28)

We are 95% confident that μ falls within the interval (23.72, 24.28).

18 / 33

Previous example: σ unknown

Let X1,X2,...,X200 be independent N(μ,σ2) random variables. We collect the sample of size 200, and the resulting sample mean, ¯x, is ¯x=24. σ2 is unknown, but using our sample, we calculate the sample variance, s2 to be 4.1. What is a 95% confidence interval for μ?

Since n=200 is large, a 95% confidence interval for μ is given by (¯xzα2sn,¯x+zα2sn)

Substitute ¯x=24, zα2=1.96, s=4.1, n=200, giving (241.964.1/200,24+1.964.1/200) or (23.72, 24.28)

We are 95% confident that μ falls within the interval (23.72, 24.28).

19 / 33

Second case (n small)

  • If n is small, we cannot use CLT, so
    ¯XμS/n is not approximately N(0,1)

  • Instead, ¯XμS/ntn1 when Xi are independent and N(μ,σ2)

  • In other words, the underlying distribution of Xi is now restricted to normal

  • An α-level confidence interval for μ is (¯xtn1,α2sn,¯x+tn1,α2sn)

  • The notation tn1,α2 means the quantile corresponding to a probability of α/2 in the right tail, for a tn1 distribution. We can use qt(alpha/2, df = n-1, lower.tail = FALSE), e.g.,

qt(.025, df = 19, lower.tail = FALSE)
[1] 2.093024
20 / 33

Student's t distribution

  • While working for Guinness Brewery in Dublin, William Sealy Gosset published a paper on the t distribution, which became known as Student's t distribution

  • He used the new distribution to determine how large a sample of persons to use in taste-testing beer

  • Guinness worried competitors would steal their secret if he published under his own name, so Gosset published under the pseudonym "Student"

21 / 33

Student's t distribution

  • The t distribution looks like the normal distribution, except that it has fatter tails

  • The t distribution has a parameter called degrees of freedom, abbreviated df. For each possible df, there is a different t distribution, with the t distribution looking more like the normal as df gets large.

  • As the sample size gets bigger, the t distribution approximates the normal distribution

22 / 33

Student's t distribution

  • The random variable T=¯XμS/n has a Student's t distribution with n1 degrees of freedom, represented using the notation tn1.

    • Recall that here S2=1n1ni=1(Xi¯X)2, where ¯X=ni=1Xin is the sample mean
  • In other words, when we replace σ by S, the distribution is now t and not standard normal

  • An α-level confidence interval for μ is (¯xtn1,α2sn,¯x+tn1,α2sn)

23 / 33

Some intuition

  • An α-level confidence interval for μ is (¯xtn1,α2sn,¯x+tn1,α2sn)
qnorm(.975)
[1] 1.959964
qt(.975, df = 19)
[1] 2.093024
  • The t distribution is used to construct confidence intervals for the mean when we need to account for the additional variability due to estimating σ in addition to μ

  • The fatter tails lead to wider confidence intervals; makes sense since there is extra uncertainty due to the estimation of σ2

  • Very roughly speaking, the degrees of freedom measure the amount of information available in the data to estimate σ2 and thus gives us information about how reliable our estimate s2 is.

24 / 33

Confidence Intervals with σ Unknown

Case 1: When n is large

  • Doesn't matter what the distribution of Xi is; mean μ and variance σ2
  • ¯XμS/nN(0,1)
  • We can use the interval (¯xzα2sn,¯x+zα2sn)

Case 2: When n is small

  • When Xi is normally distributed with mean μ and variance σ2, T=¯XμS/ntn1
  • We can use the interval (ˉxtn1,α2sn,ˉx+tn1,α2sn)
25 / 33

Example: Lead in Flint, MI

From April 25, 2014 to October 15, 2015, the water supply source for Flint, MI was switched to the Flint River from the Detroit water system. Without corrosion inhibitors, the Flint River water, which is high in chloride, caused lead from aging pipes to leach into the water supply. We have data from Flint collected as part of a citizen-science project involving Virginia Tech researchers.

26 / 33

Example: Lead in Flint, MI

flint <- readxl::read_excel("./data/Flint-Samples.xlsx", sheet = 1) %>%
rename("Pb_initial"="Pb Bottle 1 (ppb) - First Draw")
str(flint)
tibble [271 x 7] (S3: tbl_df/tbl/data.frame)
$ SampleID : num [1:271] 1 2 4 5 6 7 8 9 12 13 ...
$ Zip Code : num [1:271] 48504 48507 48504 48507 48505 ...
$ Ward : num [1:271] 6 9 1 8 3 9 9 5 9 3 ...
$ Pb_initial : num [1:271] 0.344 8.133 1.111 8.007 1.951 ...
$ Pb Bottle 2 (ppb) - 45 secs flushing: num [1:271] 0.226 10.77 0.11 7.446 0.048 ...
$ Pb Bottle 3 (ppb) - 2 mins flushing : num [1:271] 0.145 2.761 0.123 3.384 0.035 ...
$ Notes : chr [1:271] NA NA NA NA ...
  • Each row is a sample and the lead level in the sample is in the Pb_initial column. The units are parts per billion (ppb)

  • We want to construct a confidence interval for the unknown population mean lead level in Flint water

  • What is the sample size n? What confidence interval should we use?

27 / 33

Calculating a Confidence Interval for Flint Lead

  • Here there are 271 samples, so n qualifies as large

  • For a 95% confidence interval: (¯xzα2sn,¯x+zα2sn)

    xBar <- mean(flint$Pb_initial)
    sigma2 <- var(flint$Pb_initial)
    n <- length(flint$Pb_initial)
    halfWidth <- qnorm(.975)*sqrt(sigma2/n)
    (lower <- xBar - halfWidth)
    [1] 8.078981
    (upper <- xBar + halfWidth)
    [1] 13.213
    Hence a 95% confidence interval for the mean lead level in water in Flint is (8.0789808, 13.2130044)
28 / 33

Using the t-distribution

t.test(flint$Pb_initial)
One Sample t-test
data: flint$Pb_initial
t = 8.1284, df = 270, p-value = 1.58e-14
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
8.067422 13.224563
sample estimates:
mean of x
10.64599
29 / 33

Confidence interval for population proportion

  • Recall that when X is a Bernoulli random variable, the sample mean ¯X=ni=1Xin is the same as the sample proportion ˆP.

  • We have ¯X=ˆPN(p,p(1p)n) by the Central Limit Theorem, when n is large

  • p is unknown, but we can replace it by ˆp, calculated from the sample, in the same way we replaced σ by s for the population mean (case 1, when n large), so an α-level confidence interval for p is

(ˆpzα2ˆp(1ˆp)n,ˆp+zα2ˆp(1ˆp)n)

30 / 33

Example: Approval ratings

  • Margin of error = zα2ˆp(1ˆp)n
qnorm(.975)*sqrt(.5^2/1500)
[1] 0.02530303
31 / 33

Example: Flint

The EPA action level for lead in public water supplies is 15 ppb. Let's calculate the estimated proportion of samples (homes) in Flint with lead levels over 15 ppb along with a 95% confidence interval for the proportion.

flint <- flint %>%
mutate(Pbover15 = Pb_initial > 15)
pHat <- mean(flint$Pbover15)
var <- pHat*(1 - pHat)/length(flint$Pbover15)
halfWidth <- qnorm(.975)*sqrt(var)
(lower <- pHat - halfWidth)
[1] 0.1217465
(upper <- pHat + halfWidth)
[1] 0.2103569

Hence a 95% confidence interval for the proportion of homes in Flint with lead level above 15 ppb is (0.1217465, 0.2103569).

32 / 33

Summary

  • More on confidence intervals:

    • Changing n and alpha

    • Simulation example

    • σ2 unknown:

      • n large: (¯xzα2sn,¯x+zα2sn)
      • n small and Xi normal: (¯xtn1,α2sn,¯x+tn1,α2sn)
    • Confidence interval for population proportion: (ˆpzα2ˆp(1ˆp)n,ˆp+zα2ˆp(1ˆp)n)

33 / 33
2 / 33
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow