Upload a PDF file, named with your UC Davis email ID and homework
number (e.g., xjw18_hw3.pdf
), to Gradescope (accessible
through Canvas). You will give the commands to answer each question in
its own code block, which will also produce output that will be
automatically embedded in the output file. When asked, answer must be
supported by written statements as well as any code used.
All code used to produce your results must be shown in your PDF
file (e.g., do not use echo = FALSE
or
include = FALSE
as options anywhere). Rmd
files do not need to be submitted, but may be requested by the TA and
must be available when the assignment is submitted.
Students may choose to collaborate with each other on the homework, but must clearly indicate with whom they collaborated. Every student must upload their own submission.
Start to work on it as early as possible. Finishing this homework can help prepare midterm 1.
When you want to show your result as a vector that is too long,
slice the first 10 objects. When you want to show your result as a data
frame, use head()
on it. Failure to do so may lead to point
deduction.
Directly knit the Rmd file will give you an html file. Open that file in your browser and then you can print it into a PDF file.
Some Level 2 problems will show in the level 1 parts because they share the same background, but please type your solution in Level 2 parts!!
If you have not installed these packages, please install them by uncommenting the code. You only need to install the package once.
#install.packages("ggplot2")
#install.packages("tidyverse")
#install.packages("ggthemes")
#install.packages("ggrepel")
#install.packages("ggwordcloud")
Loading the required pacakges.
library(tidyverse)
library(ggplot2)
library(ggthemes)
library(ggwordcloud)
State True or False for each of the statements. Then provide your reason.
For a sample dataset, it is not possible for the first quartile to equal the median.
If the sample correlation for variable X and Y is zero, then the variable X and Y does not have any relationship.
The standard deviation must always be larger than the mean.
Outliers have a strong influence on the standard deviation of a dataset.
The salaries of all persons employed by a large university is left skewed.
The amounts of the time spent by students on a difficult examination is left skewed.
We use R to calculate the mean and variance of the student
heights in the dataset heights
. Mean is
mean(heights$height)=68.32
and variance is
var(heights$height)=16.64
. So the population mean is 68.32
and population variance is 16.64.
Here we load the baby names data for you. This dataset comprising counts of the given names of babies born in California each year from 1990 - 2018.
myfile <- "https://raw.githubusercontent.com/xjw1001001/xjw1001001.github.io/main/data/homework%20data/baby_names.csv"
data = read.csv(myfile)
head(data)
State Sex Year Name Count
1 CA F 1990 Jessica 6635
2 CA F 1990 Ashley 4537
3 CA F 1990 Stephanie 4001
4 CA F 1990 Amanda 3856
5 CA F 1990 Jennifer 3611
6 CA F 1990 Elizabeth 3170
For this problem, use the dataframe output or function output is
enough. No need to answer in sentences. If the problem ask you of
several operations, use pipes and only output the last result. If the
resulting dataframe has more than 6 rows, use head()
to
show the first 6 rows.
1. What’s the dimension of baby_names
? How many
observations (rows) are Male and how many observations (rows) are
Female?
2. Count the number of occurrences of each distinct year and then sort the result by year.
3. Select Count, Sex, Name, and Year columns, then use filter to retrieve rows of girl names in 2010 that have larger than 3000 counts.
4. Use summarize to compute the maximum count and average count of
names given for each year in 1995 to 2000, You may need to filter first,
and then pipe into knitr::kable
function to generate table
output. Use 3 digits in the table.
5. (Level 2) Display the (first) most common boy and girl names and their counts in each of the years 2005-2010.
Hint: slice_max()
Can help you select the row with
highest count. So you need pipe slice_max(Count)
after
group_by
.
Answer it in Level 2 parts, not here
The dataset diamonds in ggplot2
package contains the
prices and other attributes of almost 54,000 diamonds.
head(diamonds)
# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
1. Use histogram to explore the distribution of carat
.
Choose a reasonable large binwidth (around 0.5) and another reasonable
small binwidth (around 0.05). What do you find comparing these two
plots? Do you discover anything unusual or surprising?
Instruction:
ggplot(?, aes(x = ?)) +
geom_?(?) +
labs(x = ?, y = ?, title = ?)
2. What’s the sample correlation between the price
and
carat
? Hint: You can access the vector using
$
.
3. Produce a scatter plot and a hex plot between price and carat. You
can try geom_hex(bins=25)
or another value. Why hex plot is
better here?
Instruction:
ggplot(?, aes(x = ?, y = ?)) +
geom_?(?) +
labs(x = ?, y = ?, title = ?)
4. Create a horizontal boxplot to visualize the distribution of price
by cut in the diamonds dataset. Choose appropriate x, y labels and
title. Use fill = "lightblue", color = "blue"
in
geom_boxplot()
. What do you find in this plot?
Instruction:
ggplot(?, aes(x = ?, y = ?)) +
geom_boxplot(fill = "lightblue", color = "blue) +
labs(x = ?, y = ?, title = ?)
5. Create a density plot of carat facet by color in the diamond dataset. Comment any unusual information you find in the plot.
Instruction:
ggplot(?, aes(x = ?, fill = color)) +
geom_density(? = ?) +
facet_?(?, ? = ?) +
labs(x = ?, y = ?, title = ?)
data
in Problem 2.Display the (first) most common boy and girl names and their counts in each of the years 2005-2010.
Hint: slice_max()
Can help you select the row with
highest count. So you need pipe slice_max(Count)
after
group_by
.
1. Bar plot: We can create a bar plot showing the total count of male vs. female names in the dataset by year from 2008 to 2015.
data %>%
. We need to obtain the
dataset of total count of male and female names for year from 2008 to
2015.group_by(What variables?) %>%
summarise(total=which_function(what_variable) %>%
filter(some_condition) %>%
ggplot(aes(x = ?, y = ?, fill = ?)) +
%>%
or |>
, concat the ggplot layers we
use +
geom_bar(stat = ?, position = ?) +
labs
to change the labels, title2. Line chart: We can create a line plot showing the trend in the number of names: Jessica Emily Isabella Amanda Jennifer for all these years.
names = c("Jessica", "Emily", "Isabella", "Amanda", "Jennifer")
data %>%
We need to obtain the
dataset of count of the selected names for each year.filter(What_condition?) %>%
filter(Sex == "F") %>%
Why do we need to add this
line? What happened if we don’t add this line, can you explain why?ggplot(aes(? = ?, ? = ?, ? = ?)) +
geom_?() +
labs
to change the labels, titlescale_color_viridis_d(option = "H")
3. Word cloud: We can create a word cloud showing the most common names overall. This can give us a visual representation of the most popular names in the dataset.
data %>%
. We need to obtain the total
count of the name, sex pair.group_by(What_variables?) %>%
summarise(Count = sum(?)) %>%
arrange(desc(Count)) %>% head(50) %>%
We need
this because we want only show the result of the 50 most popular
names.ggplot(aes(label = ?, size = ?, color = ?)) +
We want
to show the names, size represent the frequency, color represent the
sex.geom_text_wordcloud() +
theme_minimal()
Reference: ggwordcloud
Names:
sessionInfo()
R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggwordcloud_0.5.0 ggthemes_4.2.4 lubridate_1.9.2 forcats_1.0.0
[5] stringr_1.5.0 dplyr_1.1.0 purrr_1.0.1 readr_2.1.4
[9] tidyr_1.3.0 tibble_3.2.0 ggplot2_3.4.1 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.10 bslib_0.4.2 compiler_4.2.3 pillar_1.8.1
[5] jquerylib_0.1.4 tools_4.2.3 digest_0.6.31 timechange_0.2.0
[9] jsonlite_1.8.4 evaluate_0.20 lifecycle_1.0.3 gtable_0.3.2
[13] png_0.1-8 pkgconfig_2.0.3 rlang_1.1.0 cli_3.6.0
[17] rstudioapi_0.14 yaml_2.3.7 xfun_0.37 fastmap_1.1.1
[21] withr_2.5.0 knitr_1.42 hms_1.1.3 generics_0.1.3
[25] sass_0.4.5 vctrs_0.6.0 grid_4.2.3 tidyselect_1.2.0
[29] glue_1.6.2 R6_2.5.1 fansi_1.0.4 rmarkdown_2.20
[33] tzdb_0.3.0 magrittr_2.0.3 scales_1.2.1 htmltools_0.5.4
[37] colorspace_2.1-0 utf8_1.2.3 stringi_1.7.12 munsell_0.5.0
[41] cachem_1.0.7