Upload a PDF file, named with your UC Davis email ID and homework
number (e.g., xjw18_hw7.pdf
), to Gradescope (accessible
through Canvas). You will give the commands to answer each question in
its own code block, which will also produce output that will be
automatically embedded in the output file. When asked, answer must be
supported by written statements as well as any code used.
All code used to produce your results must be shown in your PDF
file (e.g., do not use echo = FALSE
or
include = FALSE
as options anywhere). Rmd
files do not need to be submitted, but may be requested by the TA and
must be available when the assignment is submitted.
Students may choose to collaborate with each other on the homework, but must clearly indicate with whom they collaborated. Every student must upload their own submission.
Start to work on it as early as possible. Finishing this homework can help prepare midterm 1.
When you want to show your result as a vector that is too long,
slice the first 10 objects. When you want to show your result as a data
frame, use head()
on it. Failure to do so may lead to point
deduction.
Directly knit the Rmd file will give you an html file. Open that file in your browser and then you can print it into a PDF file.
# Load necessary libraries
library(tidyverse)
library(ggplot2)
library(MASS)
There are 14 variables (columns) and 506 rows in the Boston house
values data frame in the MASS
package. To look at those
variables you may run names(Boston)
however we are only
going to look at 2 of those variables:
lstat
lower status of the population
(percent).
medv
median value of owner-occupied homes in
$1000
We want to build a model use the lower status of the population (percent) to predict the median value of owner-occupied homes in $1000.
Find the estimated regression line. Use summary
to
show the result.
Interpret the slope of the estimated regression line in terms of the problem.
Interpret the intercept of the estimated regression line in terms of the problem (if appropriate).
Predict the median value of owner-occupied homes in $1000 when lower status of the population is 25%.
Interpret the value of R squared (Multiple R-squared
in summary
) in terms of the problem.
What is the \(s_e\) (\(\hat \sigma\)) from the regression result?
Find the 99% confidence interval for the slope and intercept
using confint
.
In this problem, we will use simulation to study the distribution of the regression estimates.
Suppose our true model is:
\[Y = \beta_0 +\beta_1 X +\varepsilon\]
Where \(X \sim Uniform(-3,3)\), \(\varepsilon \sim N(0,\sigma^2)\).
When \(\beta_0 = 1\),\(\beta_1 = 2\), \(\sigma = 1\), write R code to simulate 100 pairs of \(X\) and \(Y\), and store them into a data frame. Use scatter plot to show the data.
Wrap the code to generate the dataframe as a function, with name
simulate_1
. The function should take inputs as: beta_0,
beta_1, sigma and n. Use the values from part 1 as the default argument
value. About
default argument value
Now use the simulate_1()
simulate 1 dataset (blank
in the ()
to use default argument value). Then first use
the lm
function fit the model, and use the slide
formula to fit the model, check the result are the same.
Here I provide the code to fit a linear model and obtain its estimate of slope, intercept and the estimation of the sample SD of residual (sigma_hat in code):
#suppose the dataset name is df:
set.seed(123)
df = simulate_1()
fit = lm(Y~X, df)
slope = fit$coefficients[2]
intercept = fit$coefficients[1]
sigma_hat = sigma(fit)
Now calculate the sigma_hat
by the definition in the
lecture note sample
SD of residual and check it’s the same with the result
sigma(fit)
. In your calculation, the dataset
df
and the fitted value of slope and intercept just use the
ones provided in the previous code chunk.
Hint: \(e_i\) are the difference of the actual \(y_i\) with the predicted value \(\hat y_i\). The predicted value can be obtained by plug in \(x_i\) into the fitted regression line.
regression_simulation
.regression_simulation = data.frame(
index = 1:500, slope = numeric(500),
intercept = numeric(500), sigma_hat = numeric(500)
)
#you need to fill in the corresponding result for slope,
# intercept, sigma_hat in the for loop.
Now we have the 500 samples of regression line estimation: slope,
intercept and sigma_hat. Now summary their mean and standard deviation
using summarise
function.
Call summary(fit)
, compare the mean from part 6 with
the actual value of \(\beta_0,\beta_1,\sigma\). What do you find?
Compare the sd from part 6 with the Std. Error
of intercept
and X, what do you find?
Understand the p value in summary. In the lecture we know in the summary we test the null hypothesis about the slope coefficient is 0 against the alternative hypothesis the slope coefficient is not 0. The test statistic is:
\[T = \frac{\hat \beta_1 - \beta_1}{SE(\hat \beta_1)} = \frac{\hat \beta_1 - 0 }{SE(\hat \beta_1)} \sim t_{n-2} \text{ (Under null hypothesis)}\]
Now we will use our simulation of the distribution of \(\beta_1\) to check the p value of \(H_0: \beta_1 = 1.9\).
Now draw the histogram of the
regression_simulation$slope
, and add a vertical line at
1.9.
Also calculate the \(T\) statistics
when \(H_0: \beta_1 = 1.9\) by plug in
\(\beta_1 = 1.9\) (Use Problem 7
summary result), and calculate the corresponding p value by using
pt(t_statistic, df = n-2, lower.tail = F)
. What should the
df
be beased on the summary? Is the p value explain the
histogram and the vertical line?
Now in this problem, we will look at how the violations of the regression assumption: linearity will affect the linear regression model.
Suppose our true model is:
\[Y = \beta_0 +\beta_1 (X-1.5)^2 +\varepsilon\]
Where \(X \sim Uniform(-3,3)\), \(\varepsilon \sim N(0,\sigma^2)\), \(\beta_0 = 1\),\(\beta_1 = 2\), \(\sigma = 4\)
write R code to simulate 100 pairs of \(X\) and \(Y\), and store them into a data frame. Use
scatter plot to show the data. Add layer 1:
geom_smooth(color = "blue")
to add the true fitted line,
and add layer 2: geom_smooth(method = lm, color = "red")
to
add the linear regression fitted line. Remember to
set.seed(123)
Is the linear regression still a good model for this dataset? Why?
Now fit the linear model on the data you generated. Save the
fitted model as fit
. Then use plot(fit, 1)
to
generate the residual VS fitted
plot. Interpret the residual vs fitted plot.
Now fit the polynomial model on the data you generated use:
fit2 = lm(Y ~ X + I(X^2), df)
. Then use
plot(fit2, 1)
to generate the residual VS fitted plot.
Check summary(fit2)
in your console (don’t print it in rmd)
and then write down the model in mathematical form \(Y = \beta_0 + \beta_1 X + \beta_2 X^2\).
Also check the residual vs fitted plot.
Draw the \(Y = \beta_0 + \beta_1 X + \beta_2 X^2\) and the dataset using ggplot.
Now in this problem, we will look at how the violations of the regression assumption: equal residual variance will affect the linear regression model.
Suppose our true model is:
\[Y = \beta_0 +\beta_1 X +\varepsilon\]
Where \(X \sim Uniform(-3,3)\), \(\varepsilon \sim N(0,(\sigma*|X|)^2)\), \(\beta_0 = 1\),\(\beta_1 = 2\), \(\sigma = 4\)
geom_smooth(method = lm, color = "red")
to add the true
fitted line. Remember to set.seed(123)
.Hint: rnorm(n, mean_vector, sd_vector)
will generate n
normal RV, the ith mean and ith sd is the corresponding in the
mean_vector
and sd_vector
.
Is the linear regression still a good model for this dataset? Why?
Now fit the linear model on the data you generated. Save the
fitted model as fit
. Then use plot(fit, 1)
to
generate the residual VS fitted
plot. Interpret the residual vs fitted plot.
Names:
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MASS_7.3-54 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7
[5] purrr_0.3.4 readr_2.0.2 tidyr_1.1.4 tibble_3.1.5
[9] ggplot2_3.3.5 tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] tidyselect_1.1.1 xfun_0.33 bslib_0.3.1 haven_2.4.3
[5] colorspace_2.0-2 vctrs_0.6.0 generics_0.1.2 htmltools_0.5.2
[9] yaml_2.2.1 utf8_1.2.2 rlang_1.1.0 jquerylib_0.1.4
[13] pillar_1.6.3 withr_2.5.0 glue_1.4.2 DBI_1.1.1
[17] dbplyr_2.1.1 modelr_0.1.8 readxl_1.3.1 lifecycle_1.0.3
[21] cellranger_1.1.0 munsell_0.5.0 gtable_0.3.0 rvest_1.0.1
[25] evaluate_0.14 knitr_1.34 tzdb_0.1.2 fastmap_1.1.0
[29] fansi_0.5.0 broom_0.7.9 Rcpp_1.0.7 backports_1.2.1
[33] scales_1.1.1 jsonlite_1.7.2 fs_1.5.0 hms_1.1.1
[37] digest_0.6.28 stringi_1.7.4 grid_4.1.0 cli_3.6.0
[41] tools_4.1.0 magrittr_2.0.1 sass_0.4.0 crayon_1.4.1
[45] pkgconfig_2.0.3 ellipsis_0.3.2 xml2_1.3.2 reprex_2.0.1
[49] lubridate_1.8.0 assertthat_0.2.1 rmarkdown_2.11 httr_1.4.2
[53] rstudioapi_0.13 R6_2.5.1 compiler_4.1.0