class: center, middle, inverse, title-slide .title[ # Data Manipulation 2 ] .subtitle[ ##
STA 032: Gateway to data science Lecture 6 ] .author[ ### Jingwei Xiong ] .date[ ### April 14, 2023 ] --- <style type="text/css"> .tiny .remark-code { font-size: 60%; } .small .remark-code { font-size: 80%; } </style> ## Reminders - HW 1 is due April 17th at 12pm - HW 2 will be posted on the course website, due April 26 12pm. - Please start the homework as soon as possible. - Discussion will cover homework problems. --- ## Recap -- - Data manipulation tools in `tidyverse` - `select()` - `arrange()` - `slice()` - `filter()` > Remember, before using all tidyverse functions, you need to library(tidyverse) first! --- ## Today - Data manipulation tools continue - `distinct()`: filter for unique rows - `mutate()`: adds new variables - `count()`: create frequency tables - `summarise()`: perform column summarization operations - `group_by()`: for grouped operations - `pull()`: access column data as a vector or a number - `rename()`: rename an existing column - `inner_join()`, `left_join()`: join together a pair of data frames based on a variable present in both data frames that uniquely identify all observations - Data visualization introduction --- ## Data: Hotel bookings - Data from two hotels: one resort and one city hotel - Observations: Each row represents a hotel booking - Goal for original data collection: Development of prediction models to classify a hotel booking's likelihood to be cancelled ([Antonia et al., 2019](https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib5)) ```r hotels <- readr::read_csv("https://raw.githubusercontent.com/xjw1001001/xjw1001001.github.io/main/lecture/Lecture%205/data/hotels.csv") ``` .footnote[ Source: [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md) ] --- ## First question: What is in the data set? .tiny[ ```r dplyr::glimpse(hotels) ``` ``` Rows: 119,390 Columns: 32 $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resort~ $ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ~ $ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ~ $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201~ $ arrival_date_month <chr> "July", "July", "July", "July", "July",~ $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,~ $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~ $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~ $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ~ $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~ $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~ $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~ $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB~ $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR~ $ market_segment <chr> "Direct", "Direct", "Direct", "Corporat~ $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corporat~ $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~ $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~ $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~ $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C",~ $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C",~ $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~ $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposit~ $ agent <chr> "NULL", "NULL", "NULL", "304", "240", "~ $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL",~ $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~ $ customer_type <chr> "Transient", "Transient", "Transient", ~ $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,~ $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~ $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ~ $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out", ~ $ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 20~ ``` ] --- ## `distinct()` to filter for unique rows .small[ .pull-left[ ```r hotels %>% distinct(market_segment) ``` ``` # A tibble: 8 x 1 market_segment <chr> 1 Direct 2 Corporate 3 Online TA 4 Offline TA/TO 5 Complementary 6 Groups 7 Undefined 8 Aviation ``` ] .pull-left[ Recall: `arrange()` to order alphabetically ```r hotels %>% distinct(market_segment) %>% arrange(market_segment) ``` ``` # A tibble: 8 x 1 market_segment <chr> 1 Aviation 2 Complementary 3 Corporate 4 Direct 5 Groups 6 Offline TA/TO 7 Online TA 8 Undefined ``` ] ] --- #### `distinct()` using more than one variable ```r hotels %>% * distinct(hotel, market_segment) %>% arrange(hotel, market_segment) ``` ``` # A tibble: 14 x 2 hotel market_segment <chr> <chr> 1 City Hotel Aviation 2 City Hotel Complementary 3 City Hotel Corporate 4 City Hotel Direct 5 City Hotel Groups 6 City Hotel Offline TA/TO 7 City Hotel Online TA 8 City Hotel Undefined 9 Resort Hotel Complementary 10 Resort Hotel Corporate 11 Resort Hotel Direct 12 Resort Hotel Groups 13 Resort Hotel Offline TA/TO 14 Resort Hotel Online TA ``` > dinstinct() is useful when you want to extract only the unique combinations of one or more columns in a data frame, and remove duplicate rows. --- ## `mutate()` to add a new variable ```r hotels %>% mutate(little_ones = children + babies) %>% select(children, babies, little_ones) %>% arrange(desc(little_ones)) ``` ``` # A tibble: 119,390 x 3 children babies little_ones <dbl> <dbl> <dbl> 1 10 0 10 2 0 10 10 3 0 9 9 4 2 1 3 5 2 1 3 6 2 1 3 7 3 0 3 8 2 1 3 9 2 1 3 10 3 0 3 # i 119,380 more rows ``` <small>What are these functions doing? How do to the same in base R?</small> > Remember vector arithmetic? We can do similar things in homework 1 using `mutate()` --- > Remember vector arithmetic? We can do similar things in homework 1 using `mutate()` .panelset[ .panel[.panel-name[HW1, Problem 4.1] ```r temp <- c(35, 88, 42, 84, 81, 30) city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto") city_temps <- data.frame(name = city, temperature = temp) city_temps %>% mutate(Celsius_temp = 5/9 * (temperature - 32)) ``` ``` name temperature Celsius_temp 1 Beijing 35 1.666667 2 Lagos 88 31.111111 3 Paris 42 5.555556 4 Rio de Janeiro 84 28.888889 5 San Juan 81 27.222222 6 Toronto 30 -1.111111 ``` ] .panel[.panel-name[HW1, Problem 5.1] ```r library(dslabs) data(murders) murders %>% mutate(rate = total/population * 100000) %>% head() ``` ``` state abb region population total rate 1 Alabama AL South 4779736 135 2.824424 2 Alaska AK West 710231 19 2.675186 3 Arizona AZ West 6392017 232 3.629527 4 Arkansas AR South 2915918 93 3.189390 5 California CA West 37253956 1257 3.374138 6 Colorado CO West 5029196 65 1.292453 ``` ] .panel[.panel-name[HW1, Problem 7.5] ```r murders %>% mutate(rank = rank(population)) %>% arrange(desc(rank)) %>% head() ``` ``` state abb region population total rank 1 California CA West 37253956 1257 51 2 Texas TX South 25145561 805 50 3 Florida FL South 19687653 669 49 4 New York NY Northeast 19378102 517 48 5 Illinois IL North Central 12830632 364 47 6 Pennsylvania PA Northeast 12702379 457 46 ``` ]] --- ## `count()` to create frequency tables .pull-left[ ```r # alphabetical order by default hotels %>% * count(market_segment) ``` ``` # A tibble: 8 x 2 market_segment n <chr> <int> 1 Aviation 237 2 Complementary 743 3 Corporate 5295 4 Direct 12606 5 Groups 19811 6 Offline TA/TO 24219 7 Online TA 56477 8 Undefined 2 ``` ] .pull-right[ ```r # descending frequency order hotels %>% count(market_segment, * sort = TRUE) ``` ``` # A tibble: 8 x 2 market_segment n <chr> <int> 1 Online TA 56477 2 Offline TA/TO 24219 3 Groups 19811 4 Direct 12606 5 Corporate 5295 6 Complementary 743 7 Aviation 237 8 Undefined 2 ``` ] - Base R version: `table()` --- ## `count()` and `arrange()` .pull-left[ ```r # ascending frequency order hotels %>% count(market_segment) %>% * arrange(n) ``` ``` # A tibble: 8 x 2 market_segment n <chr> <int> 1 Undefined 2 2 Aviation 237 3 Complementary 743 4 Corporate 5295 5 Direct 12606 6 Groups 19811 7 Offline TA/TO 24219 8 Online TA 56477 ``` ] .pull-right[ ```r # descending frequency order # just like adding sort = TRUE hotels %>% count(market_segment) %>% * arrange(desc(n)) ``` ``` # A tibble: 8 x 2 market_segment n <chr> <int> 1 Online TA 56477 2 Offline TA/TO 24219 3 Groups 19811 4 Direct 12606 5 Corporate 5295 6 Complementary 743 7 Aviation 237 8 Undefined 2 ``` ] --- ## `count()` for multiple variables ```r hotels %>% count(hotel, market_segment) ``` ``` # A tibble: 14 x 3 hotel market_segment n <chr> <chr> <int> 1 City Hotel Aviation 237 2 City Hotel Complementary 542 3 City Hotel Corporate 2986 4 City Hotel Direct 6093 5 City Hotel Groups 13975 6 City Hotel Offline TA/TO 16747 7 City Hotel Online TA 38748 8 City Hotel Undefined 2 9 Resort Hotel Complementary 201 10 Resort Hotel Corporate 2309 11 Resort Hotel Direct 6513 12 Resort Hotel Groups 5836 13 Resort Hotel Offline TA/TO 7472 14 Resort Hotel Online TA 17729 ``` --- ## Order affects output when you `count()` .small[ .pull-left[ ```r # hotel type first hotels %>% * count(hotel, market_segment) ``` ``` # A tibble: 14 x 3 hotel market_segment n <chr> <chr> <int> 1 City Hotel Aviation 237 2 City Hotel Complementary 542 3 City Hotel Corporate 2986 4 City Hotel Direct 6093 5 City Hotel Groups 13975 6 City Hotel Offline TA/TO 16747 7 City Hotel Online TA 38748 8 City Hotel Undefined 2 9 Resort Hotel Complementary 201 10 Resort Hotel Corporate 2309 11 Resort Hotel Direct 6513 12 Resort Hotel Groups 5836 13 Resort Hotel Offline TA/TO 7472 14 Resort Hotel Online TA 17729 ``` ] .pull-right[ ```r # market segment first hotels %>% * count(market_segment, hotel) ``` ``` # A tibble: 14 x 3 market_segment hotel n <chr> <chr> <int> 1 Aviation City Hotel 237 2 Complementary City Hotel 542 3 Complementary Resort Hotel 201 4 Corporate City Hotel 2986 5 Corporate Resort Hotel 2309 6 Direct City Hotel 6093 7 Direct Resort Hotel 6513 8 Groups City Hotel 13975 9 Groups Resort Hotel 5836 10 Offline TA/TO City Hotel 16747 11 Offline TA/TO Resort Hotel 7472 12 Online TA City Hotel 38748 13 Online TA Resort Hotel 17729 14 Undefined City Hotel 2 ``` ] ] --- ## `summarize()` for summary stats ```r # mean average daily rate for all bookings hotels %>% summarize(mean_adr = mean(adr)) ``` ``` # A tibble: 1 x 1 mean_adr <dbl> 1 102. ``` - `summarize()` changes the data frame entirely - Rows are collapsed into a single summary statistic - Columns that are irrelevant to the calculation are removed ??? summarize() function is used for calculating summary statistics. We show an example of using summarize() to calculate the mean average daily rate for all bookings in the hotels data frame. One important thing to note about summarize() is that it changes the data frame entirely. Rows are collapsed into a single summary statistic, and columns that are irrelevant to the calculation are removed. This can be useful when you want to quickly calculate a summary statistic, but it's important to keep in mind that the resulting data frame will have a different structure than the original. --- ## `summarize()` is often used with `group_by()` - For grouped operations - There are two types of `hotel`, city and resort hotels - We want the mean daily rate for bookings at city vs. resort hotels ```r hotels %>% group_by(hotel) %>% summarize(mean_adr = mean(adr)) ``` ``` # A tibble: 2 x 2 hotel mean_adr <chr> <dbl> 1 City Hotel 105. 2 Resort Hotel 95.0 ``` - `group_by()` can be used with more than one group ??? Here is the common use case of combining summarize() with group_by() to perform grouped operations. We use the example of a dataset containing two types of hotels - city and resort - and show how we can use group_by() and summarize() to calculate the mean daily rate for bookings at each type of hotel. group_by() is used to group the data by the hotel column, and summarize() is used to calculate the mean average daily rate for each group. This results in a data frame with two rows, one for each type of hotel, and the mean average daily rate for each group. It's important to note that group_by() can be used with more than one group, allowing you to perform more complex grouped operations. --- ## Multiple summary statistics `summarize` can be used for multiple summary statistics as well. ```r hotels %>% summarize( n = n(), # frequencies min_adr = min(adr), mean_adr = mean(adr), median_adr = median(adr), max_adr = max(adr) ) ``` ``` # A tibble: 1 x 5 n min_adr mean_adr median_adr max_adr <int> <dbl> <dbl> <dbl> <dbl> 1 119390 -6.38 102. 94.6 5400 ``` --- ### pull(): access column data as a vector or a number After the summarize result, it is a data frame (or tibble). not a vector or number. ```r # mean average daily rate for all bookings hotels %>% summarize(mean_adr = mean(adr)) ``` ``` # A tibble: 1 x 1 mean_adr <dbl> 1 102. ``` If we want to access the number, we can use the `pull()` ```r hotels %>% summarize(mean_adr = mean(adr)) %>% pull(mean_adr) ``` ``` [1] 101.8311 ``` --- Another example: ```r hotels %>% group_by(hotel) %>% summarize(mean_adr = mean(adr)) %>% pull(mean_adr) ``` ``` [1] 105.30447 94.95293 ``` This can be useful when you want to assign a variable based on the result you calculated from the tidyverse workflow. ```r mean_adr = hotels %>% summarize(mean_adr = mean(adr)) %>% pull(mean_adr) ``` --- ### rename(): rename an existing column The syntax is `rename(new_name = old_name)`. Here we rename hotel column into hotel_name. ```r hotels %>% select(hotel:lead_time) %>% rename(hotel_name = hotel) %>% head() ``` ``` # A tibble: 6 x 3 hotel_name is_canceled lead_time <chr> <dbl> <dbl> 1 Resort Hotel 0 342 2 Resort Hotel 0 737 3 Resort Hotel 0 7 4 Resort Hotel 0 13 5 Resort Hotel 0 14 6 Resort Hotel 0 14 ``` --- ### `join` family Dplyr has a powerful group of join operations, which join together a pair of data frames based on a variable or set of variables present in both data frames that uniquely identify all observations. These variables are called keys. * inner_join: Only the rows with keys present in both datasets will be joined together. * left_join: Keeps all the rows from the first dataset, regardless of whether in second dataset, and joins the rows of the second that have keys in the first. * right_join: Keeps all the rows from the second dataset, regardless of whether in first dataset, and joins the rows of the first that have keys in the second. * full_join: Keeps all rows in both datasets. Rows without matching keys will have NA values for those variables from the other dataset. * Syntax: To join by different variables on x and y, use a named vector. For example, by = c("a" = "b") will match x\$a to y\$b. --- ### `join` family To practice with the join functions, we can use a couple of built-in R datasets. .panelset[ .panel[.panel-name[Dataset] ```r data(band_instruments2) head(band_instruments2) ``` ``` # A tibble: 3 x 2 artist plays <chr> <chr> 1 John guitar 2 Paul bass 3 Keith guitar ``` ```r data(band_members) head(band_members) ``` ``` # A tibble: 3 x 2 name band <chr> <chr> 1 Mick Stones 2 John Beatles 3 Paul Beatles ``` ] .panel[.panel-name[Inner join] ```r # Inner join band_members %>% inner_join(band_instruments2, by = c("name" = "artist")) ``` ``` # A tibble: 2 x 3 name band plays <chr> <chr> <chr> 1 John Beatles guitar 2 Paul Beatles bass ``` ] .panel[.panel-name[Left join] ```r # Left join band_members %>% left_join(band_instruments2, by = c("name" = "artist")) ``` ``` # A tibble: 3 x 3 name band plays <chr> <chr> <chr> 1 Mick Stones <NA> 2 John Beatles guitar 3 Paul Beatles bass ``` ] .panel[.panel-name[Right join] ```r # Right join band_members %>% right_join(band_instruments2, by = c("name" = "artist")) ``` ``` # A tibble: 3 x 3 name band plays <chr> <chr> <chr> 1 John Beatles guitar 2 Paul Beatles bass 3 Keith <NA> guitar ``` ] .panel[.panel-name[Full join] ```r # Full join band_members %>% full_join(band_instruments2, by = c("name" = "artist")) ``` ``` # A tibble: 4 x 3 name band plays <chr> <chr> <chr> 1 Mick Stones <NA> 2 John Beatles guitar 3 Paul Beatles bass 4 Keith <NA> guitar ``` ] ] --- ## Introduction to data visualization * Why we need data visualization? ```r library(dslabs) data(murders) head(murders) ``` ``` state abb region population total 1 Alabama AL South 4779736 135 2 Alaska AK West 710231 19 3 Arizona AZ West 6392017 232 4 Arkansas AR South 2915918 93 5 California CA West 37253956 1257 6 Colorado CO West 5029196 65 ``` * How is variable distributed? * How can we identity patterns or relationships between variables ??? Why we need data visualization? Looking at the numbers and character strings that define a dataset is rarely useful. To convince yourself, print and stare at the US murders data table: What do you learn from looking at this table? How quickly can you determine which states have the largest populations? Which states have the smallest? Is there a relationship between population size and total murders? How do murder rates vary across regions of the country? For most human brains, it is quite difficult to extract this information just by looking at the numbers. --- In contrast, the answer to all the questions above are readily available from examining this plot: .panelset[ .panel[.panel-name[Picture] <img src="lecture6_files/figure-html/ggplot-example-plot-0-1.png" width="504" /> ] .panel[.panel-name[Code] ```r library(tidyverse) library(ggthemes) library(ggrepel) library(ggplot2) r <- murders |> summarize(pop=sum(population), tot=sum(total)) |> mutate(rate = tot/pop*10^6) |> pull(rate) murders |> ggplot(aes(x = population/10^6, y = total, label = abb)) + geom_abline(intercept = log10(r), lty=2, col="darkgrey") + geom_point(aes(color=region), size = 3) + geom_text_repel() + scale_x_log10() + scale_y_log10() + xlab("Populations in millions (log scale)") + ylab("Total number of murders (log scale)") + ggtitle("US Gun Murders in 2010") + scale_color_discrete(name="Region") + theme_economist() ``` ] .panel[.panel-name[explanation] Each state in the dataset was identified in this plot as a colored point with a label next to it. The total number of murders is shown on the y axis in log scale, and the populations are shown on the x axis in millions. The state name is indicated by the text label next to the points, and the color designates the state region. The average murder rate in the US was added as a gray line (in millions). ]] ??? Each state in the dataset was identified in this plot as a colored point with a label next to it. The total number of murders is shown on the y axis in log scale, and the populations are shown on the x axis in millions. The state name is indicated by the text label next to the points, and the color designates the state region. The average murder rate in the US was added as a gray line (in millions). This picture is much more informative then the dataset itself. --- ### Why we need data visualization We are reminded of the saying **"a picture is worth a thousand words"**. Data visualization provides a powerful way to communicate a data-driven finding. Data visualization is the strongest tool of what we call _exploratory data analysis_ (EDA). [John W. Tukey](https://en.wikipedia.org/wiki/John_Tukey), considered the father of EDA, once said, >> "The greatest value of a picture is when it forces us to notice what we never expected to see." Many widely used data analysis tools were initiated by discoveries made via EDA. EDA is perhaps the most important part of data analysis, yet it is one that is often overlooked. The growing availability of informative datasets and software tools has led to increased reliance on **data visualizations** across many industries, academia, and government. --- ### Benefits of data visualization: **Communication**: Data visualization provides a powerful way to communicate complex information to both technical and non-technical audiences. **Exploration**: Data visualization allows us to explore data and identify patterns or trends that may not be apparent from numerical summaries alone. **Identification of errors or outliers**: Data visualization can help us identify potential errors or outliers in our data that may impact our analysis. **Hypothesis generation**: Data visualization can help generate new hypotheses or questions for further investigation. --- ## Another example .panelset[ .panel[.panel-name[Picture] <img src="lecture6_files/figure-html/wsj-vaccines-example-1.png" width="100%" /> ] .panel[.panel-name[Code] .tiny[ ```r #knitr::include_graphics(file.path(img_path,"wsj-vaccines.png")) data(us_contagious_diseases) the_disease <- "Measles" dat <- us_contagious_diseases |> filter(!state%in%c("Hawaii","Alaska") & disease == the_disease) |> mutate(rate = count / population * 100000 * 52 / weeks_reporting) |> mutate(state = reorder(state, rate)) jet.colors <- colorRampPalette(c("#F0FFFF", "cyan", "#007FFF", "yellow", "#FFBF00", "orange", "red", "#7F0000"), bias = 2.25) the_breaks <- seq(0, 4000, 1000) dat |> ggplot(aes(year, state, fill = rate)) + geom_tile(color = "white", size=0.35) + scale_x_continuous(expand=c(0,0)) + scale_fill_gradientn(colors = jet.colors(16), na.value = 'white', breaks = the_breaks, labels = paste0(round(the_breaks/1000),"k"), limits = range(the_breaks), name = "") + geom_vline(xintercept=1963, col = "black") + theme_minimal() + theme(panel.grid = element_blank()) + coord_cartesian(clip = 'off') + ggtitle(the_disease) + ylab("") + xlab("") + theme(legend.position = "bottom", text = element_text(size = 8)) + annotate(geom = "text", x = 1963, y = 50.5, label = "Vaccine introduced", size = 3, hjust=0) ``` ]] .panel[.panel-name[explanation] A particularly effective example is a [Wall Street Journal article](http://graphics.wsj.com/infectious-diseases-and-vaccines/?mc_cid=711ddeb86e) showing data related to the impact of vaccines on battling infectious diseases. One of the graphs shows measles cases by US state through the years with a vertical line demonstrating when the vaccine was introduced. .tiny[The plot shows the incidence rate of Measles in US states over time (years on the x-axis), represented by colored tiles for each state (on the y-axis). The incidence rate is calculated as the number of cases per 100,000 population per week, averaged over 52 weeks and adjusted for the number of weeks reporting data. States are sorted by their incidence rates, from lowest to highest, and are colored according to a gradient color scale, ranging from blue (low incidence rates) to red (high incidence rates). The plot includes a vertical line indicating the year when the Measles vaccine was introduced (1963). The plot is useful for visualizing how Measles incidence rates varied across US states over time, and how the introduction of the vaccine impacted the incidence rates.] ]] ??? A particularly effective example is a [Wall Street Journal article](http://graphics.wsj.com/infectious-diseases-and-vaccines/?mc_cid=711ddeb86e) showing data related to the impact of vaccines on battling infectious diseases. The plot shows the incidence rate of Measles in US states over time (years on the x-axis), represented by colored tiles for each state (on the y-axis). The incidence rate is calculated as the number of cases per 100,000 population per week, averaged over 52 weeks and adjusted for the number of weeks reporting data. States are sorted by their incidence rates, from lowest to highest, and are colored according to a gradient color scale, ranging from blue (low incidence rates) to red (high incidence rates). The plot includes a vertical line indicating the year when the Measles vaccine was introduced (1963). The plot is useful for visualizing how Measles incidence rates varied across US states over time, and how the introduction of the vaccine impacted the incidence rates. --- In the talks [NewInsights on Poverty](https://www.ted.com/talks/hans_rosling_reveals_new_insights_on_poverty?language=en), Hans Rosling forces us to notice the unexpected with a series of plots related to world health and economics. <!-- --> ??? The plot shows a scatterplot of life expectancy on the y-axis against fertility rate on the x-axis, with different colors and sizes of points representing different regions of the world (The West, East Asia, Latin America, Sub-Saharan Africa, and Others) and population sizes, respectively. The plot is animated over time (1962-2013), showing how the relationship between life expectancy and fertility rate changes over time for each region. The plot is useful for visualizing how life expectancy and fertility rate have changed over time for different regions of the world, and how different regions compare to one another. --- ## Data visualization using `ggplot2` .panelset[ .panel[.panel-name[Slide] .pull-left[ <img src="img/ggplot2-part-of-tidyverse.png" width="80%" /> ] .pull-right[ - ggplot2 is the tidyverse's data visualization package - create relatively **complex** and **aesthetically pleasing** plots - syntax is **intuitive** and comparatively easy to remember. - `gg` in "ggplot2" stands for Grammar of Graphics - Inspired by the book Grammar of Graphics by Leland Wilkinson ]] .panel[.panel-name[Words 1] Throughout the lecture, we will be creating plots using the __ggplot2__^[https://ggplot2.tidyverse.org/] package. Many other approaches are available for creating plots in R. We chose to use __ggplot2__ because it breaks plots into components in a way that permits beginners to create relatively **complex** and **aesthetically pleasing** plots using syntax that is **intuitive** and comparatively easy to remember. One reason __ggplot2__ is generally more intuitive for beginners is that it uses a grammar of graphics^[http://www.springer.com/us/book/9780387245447], the _gg_ in __ggplot2__. This is analogous to the way learning grammar can help a beginner construct hundreds of different sentences by learning just a handful of verbs, nouns and adjectives without having to memorize each specific sentence. Similarly, by learning a handful of __ggplot2__ building blocks and its grammar, you will be able to create hundreds of different plots. ] .panel[.panel-name[Words 2] Another reason __ggplot2__ is easy for beginners is that it is possible to create informative and elegant graphs with relatively simple and readable code. To use __ggplot2__ you will have to learn several functions and arguments. These commands may be hard to memorize, but you can always return back to this tutorial and grab the code you want. Or you can simply perform an internet search for [ggplot2 cheat sheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf). ]] ??? Throughout the lecture, we will be creating plots using the __ggplot2__^[https://ggplot2.tidyverse.org/] package. Many other approaches are available for creating plots in R. We chose to use __ggplot2__ because it breaks plots into components in a way that permits beginners to create relatively **complex** and **aesthetically pleasing** plots using syntax that is **intuitive** and comparatively easy to remember. One reason __ggplot2__ is generally more intuitive for beginners is that it uses a grammar of graphics^[http://www.springer.com/us/book/9780387245447], the _gg_ in __ggplot2__. This is analogous to the way learning grammar can help a beginner construct hundreds of different sentences by learning just a handful of verbs, nouns and adjectives without having to memorize each specific sentence. Similarly, by learning a handful of __ggplot2__ building blocks and its grammar, you will be able to create hundreds of different plots. Another reason __ggplot2__ is easy for beginners is that it is possible to create informative and elegant graphs with relatively simple and readable code. To use __ggplot2__ you will have to learn several functions and arguments. These commands may be hard to memorize, but you can always return back to this tutorial and grab the code you want. Or you can simply perform an internet search for [ggplot2 cheat sheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf). --- ## Next week: Learn how to use `ggplot2` to generate the first example! --- # Readings - [Chapter 4:The tidyverse](http://rafalab.dfci.harvard.edu/dsbook/tidyverse.html) - [Data Wrangling with Tidyverse](https://hbctraining.github.io/Intro-to-R/lessons/tidyverse_data_wrangling.html) - [Chapter 7:Introduction to data visualization](http://rafalab.dfci.harvard.edu/dsbook/introduction-to-data-visualization.html)