Confidence intervals:
Introduction and interpretation
Construction using Central Limit Theorem: (¯x−zα2σ√n,¯x+zα2σ√n)
More on confidence intervals:
Changing n and alpha
Simulation example
σ2 unknown
Confidence interval for population proportion
We showed that P(¯X−zα2σ√n≤μ≤¯X+zα2σ√n)=1−α
And hence after collecting data and computing the sample mean, ˉx, an α-level confidence interval for μ is
(¯x−zα2σ√n,¯x+zα2σ√n)
This relied on the Central Limit Theorem: when n large, Z=¯X−μσ/√n≈N(0,1), in other words ¯X≈N(μ,σ2n)
Let X1,X2,...,X200 be independent N(μ,4) random variables. We collect the sample of size 200, and the resulting sample mean, ¯x, is ¯x=24. What is a 95% confidence interval for μ?
Since n=200 is large, by CLT, a 95% confidence interval for μ is given by (¯x−zα2σ√n,¯x+zα2σ√n)
Substitute ¯x=24, zα2=1.96, σ=2, n=200, giving (24−1.96∗2/√200,24+1.96∗2/√200) or (23.72, 24.28)
We are 95% confident that μ falls within the interval (23.72, 24.28).
(¯x−zα2σ√n,¯x+zα2σ√n)
For a given zα2, confidence intervals that are narrower indicate greater certainty in estimated values. We can get narrower intervals by increasing n, the sample size.
We had n=200, ¯x=24, σ=2, and our 95% confidence interval for μ was (23.72, 24.28).
What about when n=2000?
(¯x−zα2σ√n,¯x+zα2σ√n)
Substitute ¯x=24, zα2=1.96, σ=2, n=2000, giving (24−1.96∗2/√2000,24+1.96∗2/√2000) or (23.91, 24.09)
While 95% confidence intervals are the most common, it is simple to generate other intervals, for example 99% intervals, by replacing the critical value. E.g., for a 99% interval, we need the z-score that cuts off the upper 0.005 of the distribution, which is 2.58.
qnorm(.995)
[1] 2.575829
In our earlier example, we had n=200, ¯x=24, σ=2, and our 95% confidence interval for μ was (23.72, 24.28). When α=.01, a 99% confidence interval is (24−2.58∗2/√200,24+2.58∗2/√200) or (23.64, 24.36).
In a recent study of 50 randomly selected statistics students, they were asked the number of hours per week they spend studying for their statistics classes. The results were used to estimate the mean time for all statistics students with 90%, 95% and 99% confidence intervals. These were (not necessarily in the same order): (7.5,8.5) (7.6,8.4) (7.7,8.3).
Which interval is which?
Let X be normally distributed with mean μ=3 and variance σ2=25, i.e., X∼N(3,52)
myDraws <- rnorm(1000, mean = 3, sd = 5)xBar <- mean(myDraws)n <- length(myDraws)halfWidth <- qnorm(.975)*5/sqrt(n)(lower <- xBar - halfWidth)
[1] 2.430562
(upper <- xBar + halfWidth)
[1] 3.050357
Hence a 95% confidence interval for μ is (2.4305619, 3.0503569).
set.seed(0)CIsDF <- data.frame(lower = rep(NA, 5000), upper = rep(NA, 5000))for (i in 1:nrow(CIsDF)) { myDraws <- rnorm(1000, mean = 3, sd = 5) xBar <- mean(myDraws) n <- length(myDraws) halfWidth <- qnorm(.975)*5/sqrt(n) CIsDF[i, "lower"] <- xBar - halfWidth CIsDF[i, "upper"] <- xBar + halfWidth}
How many of the 5000 CIs do we expect include the true population mean, μ=3?
How many of the 5000 CIs do we expect include the true population mean, μ=3?
Recall interpretation of confidence intervals: If we were to repeat this procedure a large number of times, sampling and constructing confidence intervals in the same way, 95% of constructed intervals would contain the true population parameter.
If we repeat the experiment 5,000 times, i.e., draw samples and construct 5,000 confidence intervals, we would expect 4,750 of these to contain the true population parameter
CIsDF %>% mutate(index = 1:nrow(CIsDF), estimate = (lower + upper)/2) %>% slice(1:20) %>% ggplot(aes(estimate, index)) + geom_pointrange(aes(xmin = lower, xmax = upper)) + geom_vline(xintercept = 3, colour = "grey60", linetype = 2) + labs(title = "First 20 CIs for population mean", x = "Estimate", y = "Index")
sum(CIsDF$lower <= 3 & CIsDF$upper >= 3)
[1] 4769
sum(CIsDF$lower <= 3 & CIsDF$upper >= 3) / nrow(CIsDF)
[1] 0.9538
set.seed(0)CIsDF <- data.frame(lower = rep(NA, 5000), upper = rep(NA, 5000))for (i in 1:nrow(CIsDF)) { myDraws <- rnorm(1000, mean = 3, sd = 5) xBar <- mean(myDraws) n <- length(myDraws) halfWidth <- qnorm(.95)*5/sqrt(n) CIsDF[i, "lower"] <- xBar - halfWidth CIsDF[i, "upper"] <- xBar + halfWidth}sum(CIsDF$lower <= 3 & CIsDF$upper >= 3)
[1] 4542
sum(CIsDF$lower <= 3 & CIsDF$upper >= 3) / nrow(CIsDF)
[1] 0.9084
The confidence interval involves σ2; most of the time, this is unknown
Recall that we can use the sample variance, s2=1n−1∑ni=1(xi−ˉx)2, to estimate σ2
Before collecting the data: S2=1n−1∑ni=1(Xi−¯X)2
If we estimate σ2 using S2, then we can't use the Central Limit Theorem in the same way
Two ways we can make progress:
If n is large, CLT still holds (for the σ2 version), and using another theorem (out of scope for this class), we can prove that ¯X−μS/√n≈N(0,1)
Notice that the only difference is that we have replaced σ by S.
An α-level confidence interval for μ is (¯x−zα2s√n,¯x+zα2s√n)
Just like before, it doesn't matter what the distribution of Xi is
Let X1,X2,...,X200 be independent N(μ,4) random variables. We collect the sample of size 200, and the resulting sample mean, ¯x, is ¯x=24. What is a 95% confidence interval for μ?
Since n=200 is large, by CLT, a 95% confidence interval for μ is given by (¯x−zα2σ√n,¯x+zα2σ√n)
Substitute ¯x=24, zα2=1.96, σ=2, n=200, giving (24−1.96∗2/√200,24+1.96∗2/√200) or (23.72, 24.28)
We are 95% confident that μ falls within the interval (23.72, 24.28).
Let X1,X2,...,X200 be independent N(μ,σ2) random variables. We collect the sample of size 200, and the resulting sample mean, ¯x, is ¯x=24. σ2 is unknown, but using our sample, we calculate the sample variance, s2 to be 4.1. What is a 95% confidence interval for μ?
Since n=200 is large, a 95% confidence interval for μ is given by (¯x−zα2s√n,¯x+zα2s√n)
Substitute ¯x=24, zα2=1.96, s=√4.1, n=200, giving (24−1.96∗√4.1/√200,24+1.96∗√4.1/√200) or (23.72, 24.28)
We are 95% confident that μ falls within the interval (23.72, 24.28).
If n is small, we cannot use CLT, so
¯X−μS/√n is not approximately N(0,1)
Instead, ¯X−μS/√n∼tn−1 when Xi are independent and ∼N(μ,σ2)
In other words, the underlying distribution of Xi is now restricted to normal
An α-level confidence interval for μ is (¯x−tn−1,α2s√n,¯x+tn−1,α2s√n)
The notation tn−1,α2 means the quantile corresponding to a probability of α/2 in the right tail, for a tn−1 distribution. We can use qt(alpha/2, df = n-1, lower.tail = FALSE)
, e.g.,
qt(.025, df = 19, lower.tail = FALSE)
[1] 2.093024
While working for Guinness Brewery in Dublin, William Sealy Gosset published a paper on the t distribution, which became known as Student's t distribution
He used the new distribution to determine how large a sample of persons to use in taste-testing beer
The t distribution has a parameter called degrees of freedom, abbreviated df. For each possible df, there is a different t distribution, with the t distribution looking more like the normal as df gets large.
As the sample size gets bigger, the t distribution approximates the normal distribution
The random variable T=¯X−μS/√n has a Student's t distribution with n−1 degrees of freedom, represented using the notation tn−1.
In other words, when we replace σ by S, the distribution is now t and not standard normal
An α-level confidence interval for μ is (¯x−tn−1,α2s√n,¯x+tn−1,α2s√n)
qnorm(.975)
[1] 1.959964
qt(.975, df = 19)
[1] 2.093024
The t distribution is used to construct confidence intervals for the mean when we need to account for the additional variability due to estimating σ in addition to μ
The fatter tails lead to wider confidence intervals; makes sense since there is extra uncertainty due to the estimation of σ2
Very roughly speaking, the degrees of freedom measure the amount of information available in the data to estimate σ2 and thus gives us information about how reliable our estimate s2 is.
Case 1: When n is large
Case 2: When n is small
From April 25, 2014 to October 15, 2015, the water supply source for Flint, MI was switched to the Flint River from the Detroit water system. Without corrosion inhibitors, the Flint River water, which is high in chloride, caused lead from aging pipes to leach into the water supply. We have data from Flint collected as part of a citizen-science project involving Virginia Tech researchers.
flint <- readxl::read_excel("./data/Flint-Samples.xlsx", sheet = 1) %>% rename("Pb_initial"="Pb Bottle 1 (ppb) - First Draw")str(flint)
tibble [271 x 7] (S3: tbl_df/tbl/data.frame) $ SampleID : num [1:271] 1 2 4 5 6 7 8 9 12 13 ... $ Zip Code : num [1:271] 48504 48507 48504 48507 48505 ... $ Ward : num [1:271] 6 9 1 8 3 9 9 5 9 3 ... $ Pb_initial : num [1:271] 0.344 8.133 1.111 8.007 1.951 ... $ Pb Bottle 2 (ppb) - 45 secs flushing: num [1:271] 0.226 10.77 0.11 7.446 0.048 ... $ Pb Bottle 3 (ppb) - 2 mins flushing : num [1:271] 0.145 2.761 0.123 3.384 0.035 ... $ Notes : chr [1:271] NA NA NA NA ...
Each row is a sample and the lead level in the sample is in the Pb_initial
column. The units are parts per billion (ppb)
We want to construct a confidence interval for the unknown population mean lead level in Flint water
What is the sample size n? What confidence interval should we use?
Here there are 271 samples, so n qualifies as large
For a 95% confidence interval: (¯x−zα2s√n,¯x+zα2s√n)
xBar <- mean(flint$Pb_initial)sigma2 <- var(flint$Pb_initial)n <- length(flint$Pb_initial)halfWidth <- qnorm(.975)*sqrt(sigma2/n)(lower <- xBar - halfWidth)
[1] 8.078981
(upper <- xBar + halfWidth)
[1] 13.213
t.test(flint$Pb_initial)
One Sample t-testdata: flint$Pb_initialt = 8.1284, df = 270, p-value = 1.58e-14alternative hypothesis: true mean is not equal to 095 percent confidence interval: 8.067422 13.224563sample estimates:mean of x 10.64599
Recall that when X is a Bernoulli random variable, the sample mean ¯X=∑ni=1Xin is the same as the sample proportion ˆP.
We have ¯X=ˆP≈N(p,p(1−p)n) by the Central Limit Theorem, when n is large
p is unknown, but we can replace it by ˆp, calculated from the sample, in the same way we replaced σ by s for the population mean (case 1, when n large), so an α-level confidence interval for p is
(ˆp−zα2√ˆp(1−ˆp)√n,ˆp+zα2√ˆp(1−ˆp)√n)
Source: https://www.rasmussenreports.com/public_content/politics/biden_administration/prez_track_sep23
Source: https://www.rasmussenreports.com/public_content/politics/biden_administration/prez_track_sep23
qnorm(.975)*sqrt(.5^2/1500)
[1] 0.02530303
The EPA action level for lead in public water supplies is 15 ppb. Let's calculate the estimated proportion of samples (homes) in Flint with lead levels over 15 ppb along with a 95% confidence interval for the proportion.
flint <- flint %>% mutate(Pbover15 = Pb_initial > 15)pHat <- mean(flint$Pbover15)var <- pHat*(1 - pHat)/length(flint$Pbover15)halfWidth <- qnorm(.975)*sqrt(var)(lower <- pHat - halfWidth)
[1] 0.1217465
(upper <- pHat + halfWidth)
[1] 0.2103569
Hence a 95% confidence interval for the proportion of homes in Flint with lead level above 15 ppb is (0.1217465, 0.2103569).
More on confidence intervals:
Changing n and alpha
Simulation example
σ2 unknown:
Confidence interval for population proportion: (ˆp−zα2√ˆp(1−ˆp)√n,ˆp+zα2√ˆp(1−ˆp)√n)
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Confidence intervals:
Introduction and interpretation
Construction using Central Limit Theorem: (¯x−zα2σ√n,¯x+zα2σ√n)
More on confidence intervals:
Changing n and alpha
Simulation example
σ2 unknown
Confidence interval for population proportion
We showed that P(¯X−zα2σ√n≤μ≤¯X+zα2σ√n)=1−α
And hence after collecting data and computing the sample mean, ˉx, an α-level confidence interval for μ is
(¯x−zα2σ√n,¯x+zα2σ√n)
This relied on the Central Limit Theorem: when n large, Z=¯X−μσ/√n≈N(0,1), in other words ¯X≈N(μ,σ2n)
Let X1,X2,...,X200 be independent N(μ,4) random variables. We collect the sample of size 200, and the resulting sample mean, ¯x, is ¯x=24. What is a 95% confidence interval for μ?
Since n=200 is large, by CLT, a 95% confidence interval for μ is given by (¯x−zα2σ√n,¯x+zα2σ√n)
Substitute ¯x=24, zα2=1.96, σ=2, n=200, giving (24−1.96∗2/√200,24+1.96∗2/√200) or (23.72, 24.28)
We are 95% confident that μ falls within the interval (23.72, 24.28).
(¯x−zα2σ√n,¯x+zα2σ√n)
For a given zα2, confidence intervals that are narrower indicate greater certainty in estimated values. We can get narrower intervals by increasing n, the sample size.
We had n=200, ¯x=24, σ=2, and our 95% confidence interval for μ was (23.72, 24.28).
What about when n=2000?
(¯x−zα2σ√n,¯x+zα2σ√n)
Substitute ¯x=24, zα2=1.96, σ=2, n=2000, giving (24−1.96∗2/√2000,24+1.96∗2/√2000) or (23.91, 24.09)
While 95% confidence intervals are the most common, it is simple to generate other intervals, for example 99% intervals, by replacing the critical value. E.g., for a 99% interval, we need the z-score that cuts off the upper 0.005 of the distribution, which is 2.58.
qnorm(.995)
[1] 2.575829
In our earlier example, we had n=200, ¯x=24, σ=2, and our 95% confidence interval for μ was (23.72, 24.28). When α=.01, a 99% confidence interval is (24−2.58∗2/√200,24+2.58∗2/√200) or (23.64, 24.36).
In a recent study of 50 randomly selected statistics students, they were asked the number of hours per week they spend studying for their statistics classes. The results were used to estimate the mean time for all statistics students with 90%, 95% and 99% confidence intervals. These were (not necessarily in the same order): (7.5,8.5) (7.6,8.4) (7.7,8.3).
Which interval is which?
Let X be normally distributed with mean μ=3 and variance σ2=25, i.e., X∼N(3,52)
myDraws <- rnorm(1000, mean = 3, sd = 5)xBar <- mean(myDraws)n <- length(myDraws)halfWidth <- qnorm(.975)*5/sqrt(n)(lower <- xBar - halfWidth)
[1] 2.430562
(upper <- xBar + halfWidth)
[1] 3.050357
Hence a 95% confidence interval for μ is (2.4305619, 3.0503569).
set.seed(0)CIsDF <- data.frame(lower = rep(NA, 5000), upper = rep(NA, 5000))for (i in 1:nrow(CIsDF)) { myDraws <- rnorm(1000, mean = 3, sd = 5) xBar <- mean(myDraws) n <- length(myDraws) halfWidth <- qnorm(.975)*5/sqrt(n) CIsDF[i, "lower"] <- xBar - halfWidth CIsDF[i, "upper"] <- xBar + halfWidth}
How many of the 5000 CIs do we expect include the true population mean, μ=3?
How many of the 5000 CIs do we expect include the true population mean, μ=3?
Recall interpretation of confidence intervals: If we were to repeat this procedure a large number of times, sampling and constructing confidence intervals in the same way, 95% of constructed intervals would contain the true population parameter.
If we repeat the experiment 5,000 times, i.e., draw samples and construct 5,000 confidence intervals, we would expect 4,750 of these to contain the true population parameter
CIsDF %>% mutate(index = 1:nrow(CIsDF), estimate = (lower + upper)/2) %>% slice(1:20) %>% ggplot(aes(estimate, index)) + geom_pointrange(aes(xmin = lower, xmax = upper)) + geom_vline(xintercept = 3, colour = "grey60", linetype = 2) + labs(title = "First 20 CIs for population mean", x = "Estimate", y = "Index")
sum(CIsDF$lower <= 3 & CIsDF$upper >= 3)
[1] 4769
sum(CIsDF$lower <= 3 & CIsDF$upper >= 3) / nrow(CIsDF)
[1] 0.9538
set.seed(0)CIsDF <- data.frame(lower = rep(NA, 5000), upper = rep(NA, 5000))for (i in 1:nrow(CIsDF)) { myDraws <- rnorm(1000, mean = 3, sd = 5) xBar <- mean(myDraws) n <- length(myDraws) halfWidth <- qnorm(.95)*5/sqrt(n) CIsDF[i, "lower"] <- xBar - halfWidth CIsDF[i, "upper"] <- xBar + halfWidth}sum(CIsDF$lower <= 3 & CIsDF$upper >= 3)
[1] 4542
sum(CIsDF$lower <= 3 & CIsDF$upper >= 3) / nrow(CIsDF)
[1] 0.9084
The confidence interval involves σ2; most of the time, this is unknown
Recall that we can use the sample variance, s2=1n−1∑ni=1(xi−ˉx)2, to estimate σ2
Before collecting the data: S2=1n−1∑ni=1(Xi−¯X)2
If we estimate σ2 using S2, then we can't use the Central Limit Theorem in the same way
Two ways we can make progress:
If n is large, CLT still holds (for the σ2 version), and using another theorem (out of scope for this class), we can prove that ¯X−μS/√n≈N(0,1)
Notice that the only difference is that we have replaced σ by S.
An α-level confidence interval for μ is (¯x−zα2s√n,¯x+zα2s√n)
Just like before, it doesn't matter what the distribution of Xi is
Let X1,X2,...,X200 be independent N(μ,4) random variables. We collect the sample of size 200, and the resulting sample mean, ¯x, is ¯x=24. What is a 95% confidence interval for μ?
Since n=200 is large, by CLT, a 95% confidence interval for μ is given by (¯x−zα2σ√n,¯x+zα2σ√n)
Substitute ¯x=24, zα2=1.96, σ=2, n=200, giving (24−1.96∗2/√200,24+1.96∗2/√200) or (23.72, 24.28)
We are 95% confident that μ falls within the interval (23.72, 24.28).
Let X1,X2,...,X200 be independent N(μ,σ2) random variables. We collect the sample of size 200, and the resulting sample mean, ¯x, is ¯x=24. σ2 is unknown, but using our sample, we calculate the sample variance, s2 to be 4.1. What is a 95% confidence interval for μ?
Since n=200 is large, a 95% confidence interval for μ is given by (¯x−zα2s√n,¯x+zα2s√n)
Substitute ¯x=24, zα2=1.96, s=√4.1, n=200, giving (24−1.96∗√4.1/√200,24+1.96∗√4.1/√200) or (23.72, 24.28)
We are 95% confident that μ falls within the interval (23.72, 24.28).
If n is small, we cannot use CLT, so
¯X−μS/√n is not approximately N(0,1)
Instead, ¯X−μS/√n∼tn−1 when Xi are independent and ∼N(μ,σ2)
In other words, the underlying distribution of Xi is now restricted to normal
An α-level confidence interval for μ is (¯x−tn−1,α2s√n,¯x+tn−1,α2s√n)
The notation tn−1,α2 means the quantile corresponding to a probability of α/2 in the right tail, for a tn−1 distribution. We can use qt(alpha/2, df = n-1, lower.tail = FALSE)
, e.g.,
qt(.025, df = 19, lower.tail = FALSE)
[1] 2.093024
While working for Guinness Brewery in Dublin, William Sealy Gosset published a paper on the t distribution, which became known as Student's t distribution
He used the new distribution to determine how large a sample of persons to use in taste-testing beer
The t distribution has a parameter called degrees of freedom, abbreviated df. For each possible df, there is a different t distribution, with the t distribution looking more like the normal as df gets large.
As the sample size gets bigger, the t distribution approximates the normal distribution
The random variable T=¯X−μS/√n has a Student's t distribution with n−1 degrees of freedom, represented using the notation tn−1.
In other words, when we replace σ by S, the distribution is now t and not standard normal
An α-level confidence interval for μ is (¯x−tn−1,α2s√n,¯x+tn−1,α2s√n)
qnorm(.975)
[1] 1.959964
qt(.975, df = 19)
[1] 2.093024
The t distribution is used to construct confidence intervals for the mean when we need to account for the additional variability due to estimating σ in addition to μ
The fatter tails lead to wider confidence intervals; makes sense since there is extra uncertainty due to the estimation of σ2
Very roughly speaking, the degrees of freedom measure the amount of information available in the data to estimate σ2 and thus gives us information about how reliable our estimate s2 is.
Case 1: When n is large
Case 2: When n is small
From April 25, 2014 to October 15, 2015, the water supply source for Flint, MI was switched to the Flint River from the Detroit water system. Without corrosion inhibitors, the Flint River water, which is high in chloride, caused lead from aging pipes to leach into the water supply. We have data from Flint collected as part of a citizen-science project involving Virginia Tech researchers.
flint <- readxl::read_excel("./data/Flint-Samples.xlsx", sheet = 1) %>% rename("Pb_initial"="Pb Bottle 1 (ppb) - First Draw")str(flint)
tibble [271 x 7] (S3: tbl_df/tbl/data.frame) $ SampleID : num [1:271] 1 2 4 5 6 7 8 9 12 13 ... $ Zip Code : num [1:271] 48504 48507 48504 48507 48505 ... $ Ward : num [1:271] 6 9 1 8 3 9 9 5 9 3 ... $ Pb_initial : num [1:271] 0.344 8.133 1.111 8.007 1.951 ... $ Pb Bottle 2 (ppb) - 45 secs flushing: num [1:271] 0.226 10.77 0.11 7.446 0.048 ... $ Pb Bottle 3 (ppb) - 2 mins flushing : num [1:271] 0.145 2.761 0.123 3.384 0.035 ... $ Notes : chr [1:271] NA NA NA NA ...
Each row is a sample and the lead level in the sample is in the Pb_initial
column. The units are parts per billion (ppb)
We want to construct a confidence interval for the unknown population mean lead level in Flint water
What is the sample size n? What confidence interval should we use?
Here there are 271 samples, so n qualifies as large
For a 95% confidence interval: (¯x−zα2s√n,¯x+zα2s√n)
xBar <- mean(flint$Pb_initial)sigma2 <- var(flint$Pb_initial)n <- length(flint$Pb_initial)halfWidth <- qnorm(.975)*sqrt(sigma2/n)(lower <- xBar - halfWidth)
[1] 8.078981
(upper <- xBar + halfWidth)
[1] 13.213
t.test(flint$Pb_initial)
One Sample t-testdata: flint$Pb_initialt = 8.1284, df = 270, p-value = 1.58e-14alternative hypothesis: true mean is not equal to 095 percent confidence interval: 8.067422 13.224563sample estimates:mean of x 10.64599
Recall that when X is a Bernoulli random variable, the sample mean ¯X=∑ni=1Xin is the same as the sample proportion ˆP.
We have ¯X=ˆP≈N(p,p(1−p)n) by the Central Limit Theorem, when n is large
p is unknown, but we can replace it by ˆp, calculated from the sample, in the same way we replaced σ by s for the population mean (case 1, when n large), so an α-level confidence interval for p is
(ˆp−zα2√ˆp(1−ˆp)√n,ˆp+zα2√ˆp(1−ˆp)√n)
Source: https://www.rasmussenreports.com/public_content/politics/biden_administration/prez_track_sep23
Source: https://www.rasmussenreports.com/public_content/politics/biden_administration/prez_track_sep23
qnorm(.975)*sqrt(.5^2/1500)
[1] 0.02530303
The EPA action level for lead in public water supplies is 15 ppb. Let's calculate the estimated proportion of samples (homes) in Flint with lead levels over 15 ppb along with a 95% confidence interval for the proportion.
flint <- flint %>% mutate(Pbover15 = Pb_initial > 15)pHat <- mean(flint$Pbover15)var <- pHat*(1 - pHat)/length(flint$Pbover15)halfWidth <- qnorm(.975)*sqrt(var)(lower <- pHat - halfWidth)
[1] 0.1217465
(upper <- pHat + halfWidth)
[1] 0.2103569
Hence a 95% confidence interval for the proportion of homes in Flint with lead level above 15 ppb is (0.1217465, 0.2103569).
More on confidence intervals:
Changing n and alpha
Simulation example
σ2 unknown:
Confidence interval for population proportion: (ˆp−zα2√ˆp(1−ˆp)√n,ˆp+zα2√ˆp(1−ˆp)√n)