Instructions

Upload a PDF file, named with your UC Davis email ID and homework number (e.g., xjw18_hw7.pdf), to Gradescope (accessible through Canvas). You will give the commands to answer each question in its own code block, which will also produce output that will be automatically embedded in the output file. When asked, answer must be supported by written statements as well as any code used.
All code used to produce your results must be shown in your PDF file (e.g., do not use echo = FALSE or include = FALSE as options anywhere). Rmd files do not need to be submitted, but may be requested by the TA and must be available when the assignment is submitted.
Students may choose to collaborate with each other on the homework, but must clearly indicate with whom they collaborated. Every student must upload their own submission.
Start to work on it as early as possible. Finishing this homework can help prepare midterm 1.
When you want to show your result as a vector that is too long, slice the first 10 objects. When you want to show your result as a data frame, use head() on it. Failure to do so may lead to point deduction.
Directly knit the Rmd file will give you an html file. Open that file in your browser and then you can print it into a PDF file.

# Load necessary libraries
library(tidyverse)
library(ggplot2)
library(MASS)

Problem 1: the Boston data set

There are 14 variables (columns) and 506 rows in the Boston house values data frame in the MASS package. To look at those variables you may run names(Boston) however we are only going to look at 2 of those variables:

lstat lower status of the population (percent).
medv median value of owner-occupied homes in $1000

We want to build a model use the lower status of the population (percent) to predict the median value of owner-occupied homes in $1000.

Find the estimated regression line. Use summary to show the result.
Interpret the slope of the estimated regression line in terms of the problem.
Interpret the intercept of the estimated regression line in terms of the problem (if appropriate).
Predict the median value of owner-occupied homes in $1000 when lower status of the population is 25%.
Interpret the value of R squared (Multiple R-squared in summary) in terms of the problem.
What is the $s_e$ ($\hat \sigma$) from the regression result?
Find the 99% confidence interval for the slope and intercept using confint.

Problem 2: Sampling distribution of regression

In this problem, we will use simulation to study the distribution of the regression estimates.

Suppose our true model is:

\[Y = \beta_0 +\beta_1 X +\varepsilon\]

Where $X \sim Uniform(-3,3)$, $\varepsilon \sim N(0,\sigma^2)$.

When $\beta_0 = 1$,$\beta_1 = 2$, $\sigma = 1$, write R code to simulate 100 pairs of $X$ and $Y$, and store them into a data frame. Use scatter plot to show the data.
Wrap the code to generate the dataframe as a function, with name simulate_1. The function should take inputs as: beta_0, beta_1, sigma and n. Use the values from part 1 as the default argument value. About default argument value
Now use the simulate_1() simulate 1 dataset (blank in the () to use default argument value). Then first use the lm function fit the model, and use the slide formula to fit the model, check the result are the same.
Here I provide the code to fit a linear model and obtain its estimate of slope, intercept and the estimation of the sample SD of residual (sigma_hat in code):

#suppose the dataset name is df:
set.seed(123)
df = simulate_1()

fit = lm(Y~X, df)
slope = fit$coefficients[2]
intercept = fit$coefficients[1]
sigma_hat = sigma(fit)

Now calculate the sigma_hat by the definition in the lecture note sample SD of residual and check it’s the same with the result sigma(fit). In your calculation, the dataset df and the fitted value of slope and intercept just use the ones provided in the previous code chunk.

Hint: $e_i$ are the difference of the actual $y_i$ with the predicted value $\hat y_i$. The predicted value can be obtained by plug in $x_i$ into the fitted regression line.

Each time we simulate 100 pairs of the points, we can calculate the estimated slope and intercept, plus the estimated error standard deviation sigma_hat for these 100 pairs of the points. Now use for loop to simulate 500 sample of the 100 pairs of the points, calculate the estimated slope and intercept, sigma_hat for each of the sample. Store the result into a dataframe named regression_simulation.

regression_simulation = data.frame(
  index = 1:500, slope = numeric(500),
  intercept = numeric(500), sigma_hat = numeric(500)
)
#you need to fill in the corresponding result for slope,
# intercept, sigma_hat in the for loop.

Now we have the 500 samples of regression line estimation: slope, intercept and sigma_hat. Now summary their mean and standard deviation using summarise function.
Call summary(fit), compare the mean from part 6 with the actual value of $\beta_0,\beta_1,\sigma$. What do you find? Compare the sd from part 6 with the Std. Error of intercept and X, what do you find?
Understand the p value in summary. In the lecture we know in the summary we test the null hypothesis about the slope coefficient is 0 against the alternative hypothesis the slope coefficient is not 0. The test statistic is:

\[T = \frac{\hat \beta_1 - \beta_1}{SE(\hat \beta_1)} = \frac{\hat \beta_1 - 0 }{SE(\hat \beta_1)} \sim t_{n-2} \text{ (Under null hypothesis)}\]

Now we will use our simulation of the distribution of $\beta_1$ to check the p value of $H_0: \beta_1 = 1.9$.

Now draw the histogram of the regression_simulation$slope, and add a vertical line at 1.9.

Also calculate the $T$ statistics when $H_0: \beta_1 = 1.9$ by plug in $\beta_1 = 1.9$ (Use Problem 7 summary result), and calculate the corresponding p value by using pt(t_statistic, df = n-2, lower.tail = F). What should the df be beased on the summary? Is the p value explain the histogram and the vertical line?

Problem 3: violations of the regression assumption: linearity.

Now in this problem, we will look at how the violations of the regression assumption: linearity will affect the linear regression model.

Suppose our true model is:

\[Y = \beta_0 +\beta_1 (X-1.5)^2 +\varepsilon\]

Where $X \sim Uniform(-3,3)$, $\varepsilon \sim N(0,\sigma^2)$, $\beta_0 = 1$,$\beta_1 = 2$, $\sigma = 4$

write R code to simulate 100 pairs of $X$ and $Y$, and store them into a data frame. Use scatter plot to show the data. Add layer 1: geom_smooth(color = "blue") to add the true fitted line, and add layer 2: geom_smooth(method = lm, color = "red") to add the linear regression fitted line. Remember to set.seed(123)
Is the linear regression still a good model for this dataset? Why?
Now fit the linear model on the data you generated. Save the fitted model as fit. Then use plot(fit, 1) to generate the residual VS fitted plot. Interpret the residual vs fitted plot.
Now fit the polynomial model on the data you generated use: fit2 = lm(Y ~ X + I(X^2), df). Then use plot(fit2, 1) to generate the residual VS fitted plot. Check summary(fit2) in your console (don’t print it in rmd) and then write down the model in mathematical form $Y = \beta_0 + \beta_1 X + \beta_2 X^2$. Also check the residual vs fitted plot.
Draw the $Y = \beta_0 + \beta_1 X + \beta_2 X^2$ and the dataset using ggplot.

Problem 4: violations of the regression assumption: equal residual variance.

Now in this problem, we will look at how the violations of the regression assumption: equal residual variance will affect the linear regression model.

Suppose our true model is:

\[Y = \beta_0 +\beta_1 X +\varepsilon\]

Where $X \sim Uniform(-3,3)$, $\varepsilon \sim N(0,(\sigma*|X|)^2)$, $\beta_0 = 1$,$\beta_1 = 2$, $\sigma = 4$

write R code to simulate 100 pairs of $X$ and $Y$, and store them into a data frame. Use scatter plot to show the data. Add layer 1: geom_smooth(method = lm, color = "red") to add the true fitted line. Remember to set.seed(123).

Hint: rnorm(n, mean_vector, sd_vector) will generate n normal RV, the ith mean and ith sd is the corresponding in the mean_vector and sd_vector.

Is the linear regression still a good model for this dataset? Why?
Now fit the linear model on the data you generated. Save the fitted model as fit. Then use plot(fit, 1) to generate the residual VS fitted plot. Interpret the residual vs fitted plot.

Include the person you coop with:

Names:

Appendix

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] MASS_7.3-54     forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7    
 [5] purrr_0.3.4     readr_2.0.2     tidyr_1.1.4     tibble_3.1.5   
 [9] ggplot2_3.3.5   tidyverse_1.3.1

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.1 xfun_0.33        bslib_0.3.1      haven_2.4.3     
 [5] colorspace_2.0-2 vctrs_0.6.0      generics_0.1.2   htmltools_0.5.2 
 [9] yaml_2.2.1       utf8_1.2.2       rlang_1.1.0      jquerylib_0.1.4 
[13] pillar_1.6.3     withr_2.5.0      glue_1.4.2       DBI_1.1.1       
[17] dbplyr_2.1.1     modelr_0.1.8     readxl_1.3.1     lifecycle_1.0.3 
[21] cellranger_1.1.0 munsell_0.5.0    gtable_0.3.0     rvest_1.0.1     
[25] evaluate_0.14    knitr_1.34       tzdb_0.1.2       fastmap_1.1.0   
[29] fansi_0.5.0      broom_0.7.9      Rcpp_1.0.7       backports_1.2.1 
[33] scales_1.1.1     jsonlite_1.7.2   fs_1.5.0         hms_1.1.1       
[37] digest_0.6.28    stringi_1.7.4    grid_4.1.0       cli_3.6.0       
[41] tools_4.1.0      magrittr_2.0.1   sass_0.4.0       crayon_1.4.1    
[45] pkgconfig_2.0.3  ellipsis_0.3.2   xml2_1.3.2       reprex_2.0.1    
[49] lubridate_1.8.0  assertthat_0.2.1 rmarkdown_2.11   httr_1.4.2      
[53] rstudioapi_0.13  R6_2.5.1         compiler_4.1.0

STA 032 Homework 7

CHANGE YOUR NAME HERE

DUE: Jun 12 2023, 12PM

Instructions

Problem 1: the Boston data set

Problem 2: Sampling distribution of regression

Problem 3: violations of the regression assumption: linearity.

Problem 4: violations of the regression assumption: equal residual variance.

Include the person you coop with:

Appendix