Data Manipulation 2

class: center, middle, inverse, title-slide

.title[
# Data Manipulation 2
]
.subtitle[
## <br><br> STA 032: Gateway to data science Lecture 6
]
.author[
### Jingwei Xiong
]
.date[
### April 14, 2023
]

---

## Reminders

- HW 1 is due April 17th at 12pm

- HW 2 will be posted on the course website, due April 26 12pm.

- Please start the homework as soon as possible.

- Discussion will cover homework problems.

---

## Recap
--

- Data manipulation tools in `tidyverse`

- `select()`
  
  - `arrange()`
  
  - `slice()`
  
  - `filter()`

> Remember, before using all tidyverse functions, you need to library(tidyverse) first!

---

## Today
- Data manipulation tools continue

- `distinct()`: filter for unique rows
  
  - `mutate()`: adds new variables
  
  - `count()`: create frequency tables
  
  - `summarise()`: perform column summarization operations
  
  - `group_by()`: for grouped operations
  
  - `pull()`: access column data as a vector or a number
  
  - `rename()`:  rename an existing column
  
  - `inner_join()`, `left_join()`: join together a pair of data frames based on a variable present in both data frames that uniquely identify all observations

- Data visualization introduction

---

## Data: Hotel bookings

- Data from two hotels: one resort and one city hotel

- Observations: Each row represents a hotel booking

- Goal for original data collection: Development of prediction models to classify a hotel booking's likelihood to be cancelled ([Antonia et al., 2019](https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib5))

```r
hotels <- readr::read_csv("https://raw.githubusercontent.com/xjw1001001/xjw1001001.github.io/main/lecture/Lecture%205/data/hotels.csv")
```

.footnote[
Source: [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)
]

---

## First question: What is in the data set?

.tiny[

```r
dplyr::glimpse(hotels)
```

```
Rows: 119,390
Columns: 32
$ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort~
$ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ~
$ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ~
$ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201~
$ arrival_date_month             <chr> "July", "July", "July", "July", "July",~
$ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,~
$ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
$ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ~
$ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
$ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB~
$ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR~
$ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat~
$ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat~
$ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",~
$ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",~
$ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit~
$ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "~
$ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",~
$ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ customer_type                  <chr> "Transient", "Transient", "Transient", ~
$ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,~
$ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ~
$ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", ~
$ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20~
```
]

---
## `distinct()` to filter for unique rows

.small[
.pull-left[

```r
hotels %>% 
  distinct(market_segment) 
```

```
# A tibble: 8 x 1
  market_segment
  <chr>         
1 Direct        
2 Corporate     
3 Online TA     
4 Offline TA/TO 
5 Complementary 
6 Groups        
7 Undefined     
8 Aviation      
```
]

.pull-left[
Recall: `arrange()` to order alphabetically

```r
hotels %>% 
  distinct(market_segment) %>%
  arrange(market_segment)
```

```
# A tibble: 8 x 1
  market_segment
  <chr>         
1 Aviation      
2 Complementary 
3 Corporate     
4 Direct        
5 Groups        
6 Offline TA/TO 
7 Online TA     
8 Undefined     
```
]
]

---
#### `distinct()` using more than one variable

```r
hotels %>% 
* distinct(hotel, market_segment) %>%
  arrange(hotel, market_segment)
```

```
# A tibble: 14 x 2
   hotel        market_segment
   <chr>        <chr>         
 1 City Hotel   Aviation      
 2 City Hotel   Complementary 
 3 City Hotel   Corporate     
 4 City Hotel   Direct        
 5 City Hotel   Groups        
 6 City Hotel   Offline TA/TO 
 7 City Hotel   Online TA     
 8 City Hotel   Undefined     
 9 Resort Hotel Complementary 
10 Resort Hotel Corporate     
11 Resort Hotel Direct        
12 Resort Hotel Groups        
13 Resort Hotel Offline TA/TO 
14 Resort Hotel Online TA     
```

> dinstinct() is useful when you want to extract only the unique combinations of one or more columns in a data frame, and remove duplicate rows.

---
## `mutate()` to add a new variable

```r
hotels %>%
  mutate(little_ones = children + babies) %>% 
  select(children, babies, little_ones) %>%
  arrange(desc(little_ones))
```

```
# A tibble: 119,390 x 3
   children babies little_ones
      <dbl>  <dbl>       <dbl>
 1       10      0          10
 2        0     10          10
 3        0      9           9
 4        2      1           3
 5        2      1           3
 6        2      1           3
 7        3      0           3
 8        2      1           3
 9        2      1           3
10        3      0           3
# i 119,380 more rows
```

<small>What are these functions doing? How do to the same in base R?</small>

> Remember vector arithmetic? We can do similar things in homework 1 using `mutate()`

---

> Remember vector arithmetic? We can do similar things in homework 1 using `mutate()`

.panelset[
.panel[.panel-name[HW1, Problem 4.1]

```r
temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", 
          "San Juan", "Toronto")
city_temps <- data.frame(name = city, temperature = temp)

city_temps %>% mutate(Celsius_temp = 5/9 * (temperature - 32))
```

```
            name temperature Celsius_temp
1        Beijing          35     1.666667
2          Lagos          88    31.111111
3          Paris          42     5.555556
4 Rio de Janeiro          84    28.888889
5       San Juan          81    27.222222
6        Toronto          30    -1.111111
```
]
.panel[.panel-name[HW1, Problem 5.1]

```r
library(dslabs)
data(murders)
murders %>% mutate(rate = total/population * 100000) %>% head()
```

```
       state abb region population total     rate
1    Alabama  AL  South    4779736   135 2.824424
2     Alaska  AK   West     710231    19 2.675186
3    Arizona  AZ   West    6392017   232 3.629527
4   Arkansas  AR  South    2915918    93 3.189390
5 California  CA   West   37253956  1257 3.374138
6   Colorado  CO   West    5029196    65 1.292453
```
]

.panel[.panel-name[HW1, Problem 7.5]

```r
murders %>% mutate(rank = rank(population)) %>%
  arrange(desc(rank)) %>% head()
```

```
         state abb        region population total rank
1   California  CA          West   37253956  1257   51
2        Texas  TX         South   25145561   805   50
3      Florida  FL         South   19687653   669   49
4     New York  NY     Northeast   19378102   517   48
5     Illinois  IL North Central   12830632   364   47
6 Pennsylvania  PA     Northeast   12702379   457   46
```
]]

---

## `count()` to create frequency tables

.pull-left[

```r
# alphabetical order by default
hotels %>%
* count(market_segment)
```

```
# A tibble: 8 x 2
  market_segment     n
  <chr>          <int>
1 Aviation         237
2 Complementary    743
3 Corporate       5295
4 Direct         12606
5 Groups         19811
6 Offline TA/TO  24219
7 Online TA      56477
8 Undefined          2
```
]

.pull-right[

```r
# descending frequency order
hotels %>%
  count(market_segment, 
*       sort = TRUE)
```

```
# A tibble: 8 x 2
  market_segment     n
  <chr>          <int>
1 Online TA      56477
2 Offline TA/TO  24219
3 Groups         19811
4 Direct         12606
5 Corporate       5295
6 Complementary    743
7 Aviation         237
8 Undefined          2
```
]

- Base R version: `table()`

---

## `count()` and `arrange()`

.pull-left[

```r
# ascending frequency order
hotels %>%
  count(market_segment) %>%
* arrange(n)
```

```
# A tibble: 8 x 2
  market_segment     n
  <chr>          <int>
1 Undefined          2
2 Aviation         237
3 Complementary    743
4 Corporate       5295
5 Direct         12606
6 Groups         19811
7 Offline TA/TO  24219
8 Online TA      56477
```
]
.pull-right[

```r
# descending frequency order
# just like adding sort = TRUE
hotels %>%
  count(market_segment) %>%
* arrange(desc(n))
```

---

## `count()` for multiple variables

```r
hotels %>%
  count(hotel, market_segment) 
```

```
# A tibble: 14 x 3
   hotel        market_segment     n
   <chr>        <chr>          <int>
 1 City Hotel   Aviation         237
 2 City Hotel   Complementary    542
 3 City Hotel   Corporate       2986
 4 City Hotel   Direct          6093
 5 City Hotel   Groups         13975
 6 City Hotel   Offline TA/TO  16747
 7 City Hotel   Online TA      38748
 8 City Hotel   Undefined          2
 9 Resort Hotel Complementary    201
10 Resort Hotel Corporate       2309
11 Resort Hotel Direct          6513
12 Resort Hotel Groups          5836
13 Resort Hotel Offline TA/TO   7472
14 Resort Hotel Online TA      17729
```

---

## Order affects output when you `count()`

.small[
.pull-left[

```r
# hotel type first
hotels %>%
* count(hotel, market_segment)
```

```r
# market segment first
hotels %>%
* count(market_segment, hotel)
```

```
# A tibble: 14 x 3
   market_segment hotel            n
   <chr>          <chr>        <int>
 1 Aviation       City Hotel     237
 2 Complementary  City Hotel     542
 3 Complementary  Resort Hotel   201
 4 Corporate      City Hotel    2986
 5 Corporate      Resort Hotel  2309
 6 Direct         City Hotel    6093
 7 Direct         Resort Hotel  6513
 8 Groups         City Hotel   13975
 9 Groups         Resort Hotel  5836
10 Offline TA/TO  City Hotel   16747
11 Offline TA/TO  Resort Hotel  7472
12 Online TA      City Hotel   38748
13 Online TA      Resort Hotel 17729
14 Undefined      City Hotel       2
```
]
]

---

## `summarize()` for summary stats

```r
# mean average daily rate for all bookings
hotels %>%
  summarize(mean_adr = mean(adr)) 
```

```
# A tibble: 1 x 1
  mean_adr
     <dbl>
1     102.
```

- `summarize()` changes the data frame entirely

- Rows are collapsed into a single summary statistic

- Columns that are irrelevant to the calculation are removed

???

summarize() function is used for calculating summary statistics. We show an example of using summarize() to calculate the mean average daily rate for all bookings in the hotels data frame.

One important thing to note about summarize() is that it changes the data frame entirely. Rows are collapsed into a single summary statistic, and columns that are irrelevant to the calculation are removed. This can be useful when you want to quickly calculate a summary statistic, but it's important to keep in mind that the resulting data frame will have a different structure than the original.

---
## `summarize()` is often used with `group_by()`

- For grouped operations

- There are two types of `hotel`, city and resort hotels

- We want the mean daily rate for bookings at city vs. resort hotels

```r
hotels %>%
  group_by(hotel) %>% 
  summarize(mean_adr = mean(adr))
```

```
# A tibble: 2 x 2
  hotel        mean_adr
  <chr>           <dbl>
1 City Hotel      105. 
2 Resort Hotel     95.0
```

- `group_by()` can be used with more than one group

???

Here is the common use case of combining summarize() with group_by() to perform grouped operations. We use the example of a dataset containing two types of hotels - city and resort - and show how we can use group_by() and summarize() to calculate the mean daily rate for bookings at each type of hotel.

group_by() is used to group the data by the hotel column, and summarize() is used to calculate the mean average daily rate for each group. This results in a data frame with two rows, one for each type of hotel, and the mean average daily rate for each group.

It's important to note that group_by() can be used with more than one group, allowing you to perform more complex grouped operations.

---

## Multiple summary statistics

`summarize` can be used for multiple summary statistics as well.

```r
hotels %>%
  summarize(
    n = n(), # frequencies
    min_adr = min(adr),
    mean_adr = mean(adr),
    median_adr = median(adr),
    max_adr = max(adr)
    )
```

```
# A tibble: 1 x 5
       n min_adr mean_adr median_adr max_adr
   <int>   <dbl>    <dbl>      <dbl>   <dbl>
1 119390   -6.38     102.       94.6    5400
```

---

### pull(): access column data as a vector or a number

After the summarize result, it is a data frame (or tibble). not a vector or number.

```r
# mean average daily rate for all bookings
hotels %>%
  summarize(mean_adr = mean(adr)) 
```

```
# A tibble: 1 x 1
  mean_adr
     <dbl>
1     102.
```

If we want to access the number, we can use the `pull()`

```r
hotels %>%
  summarize(mean_adr = mean(adr)) %>% pull(mean_adr)
```

```
[1] 101.8311
```

---

Another example:

```r
hotels %>%
  group_by(hotel) %>% 
  summarize(mean_adr = mean(adr)) %>% pull(mean_adr)
```

```
[1] 105.30447  94.95293
```

This can be useful when you want to assign a variable based on the result you calculated from the tidyverse workflow.

```r
mean_adr = hotels %>%
  summarize(mean_adr = mean(adr)) %>% pull(mean_adr)
```

---

### rename(): rename an existing column

The syntax is `rename(new_name = old_name)`.

Here we rename hotel column into hotel_name.

```r
hotels %>% select(hotel:lead_time) %>%
    rename(hotel_name = hotel) %>% head()
```

```
# A tibble: 6 x 3
  hotel_name   is_canceled lead_time
  <chr>              <dbl>     <dbl>
1 Resort Hotel           0       342
2 Resort Hotel           0       737
3 Resort Hotel           0         7
4 Resort Hotel           0        13
5 Resort Hotel           0        14
6 Resort Hotel           0        14
```

---

### `join` family

Dplyr has a powerful group of join operations, which join together a pair of data frames based on a variable or set of variables present in both data frames that uniquely identify all observations. These variables are called keys.

* inner_join: Only the rows with keys present in both datasets will be joined together.

* left_join: Keeps all the rows from the first dataset, regardless of whether in second dataset, and joins the rows of the second that have keys in the first.

* right_join: Keeps all the rows from the second dataset, regardless of whether in first dataset, and joins the rows of the first that have keys in the second.

* full_join: Keeps all rows in both datasets. Rows without matching keys will have NA values for those variables from the other dataset.

* Syntax: To join by different variables on x and y, use a named vector. For example, by = c("a" = "b") will match x\$a to y\$b.
---

### `join` family

To practice with the join functions, we can use a couple of built-in R datasets.

.panelset[
.panel[.panel-name[Dataset]

```r
data(band_instruments2)
head(band_instruments2)
```

```
# A tibble: 3 x 2
  artist plays 
  <chr>  <chr> 
1 John   guitar
2 Paul   bass  
3 Keith  guitar
```

```r
data(band_members)
head(band_members)
```

```
# A tibble: 3 x 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
```

]
.panel[.panel-name[Inner join]

```r
# Inner join
band_members %>% inner_join(band_instruments2, by = c("name" = "artist"))
```

```
# A tibble: 2 x 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
```

]
.panel[.panel-name[Left join]

```r
# Left join
band_members %>% left_join(band_instruments2, by = c("name" = "artist"))
```

```
# A tibble: 3 x 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
```

]
.panel[.panel-name[Right join]

```r
# Right join
band_members %>% right_join(band_instruments2, by = c("name" = "artist"))
```

```
# A tibble: 3 x 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
3 Keith <NA>    guitar
```

]
.panel[.panel-name[Full join]

```r
# Full join
band_members %>% full_join(band_instruments2, by = c("name" = "artist"))
```

```
# A tibble: 4 x 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
4 Keith <NA>    guitar
```

]
]

---

## Introduction to data visualization

* Why we need data visualization?

```r
library(dslabs)
data(murders)
head(murders)
```

```
       state abb region population total
1    Alabama  AL  South    4779736   135
2     Alaska  AK   West     710231    19
3    Arizona  AZ   West    6392017   232
4   Arkansas  AR  South    2915918    93
5 California  CA   West   37253956  1257
6   Colorado  CO   West    5029196    65
```

* How is variable distributed?

* How can we identity patterns or relationships between variables

???

Why we need data visualization?
 
Looking at the numbers and character strings that define a dataset is rarely useful. To convince yourself, print and stare at the US murders data table:

What do you learn from looking at this table? How quickly can you determine which states have the largest populations? Which states have the smallest? Is there a relationship between population size and total murders? How do murder rates vary across regions of the country?

For most human brains, it is quite difficult to extract this information just by looking at the numbers.

---

In contrast, the answer to all the questions above are readily available from examining this plot:

.panelset[
.panel[.panel-name[Picture]

<img src="lecture6_files/figure-html/ggplot-example-plot-0-1.png" width="504" />
]
.panel[.panel-name[Code]

```r
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(ggplot2)
r <- murders |> 
  summarize(pop=sum(population), tot=sum(total)) |> 
  mutate(rate = tot/pop*10^6) |> pull(rate)
murders |> ggplot(aes(x = population/10^6, y = total, label = abb)) +  
  geom_abline(intercept = log10(r), lty=2, col="darkgrey") +
  geom_point(aes(color=region), size = 3) +
  geom_text_repel() + 
  scale_x_log10() +
  scale_y_log10() +
  xlab("Populations in millions (log scale)") + 
  ylab("Total number of murders (log scale)") +
  ggtitle("US Gun Murders in 2010") +
  scale_color_discrete(name="Region") +
  theme_economist()
```
]
.panel[.panel-name[explanation]
Each state in the dataset was identified in this plot as a colored point with a label next to it. The total number of murders is shown on the y axis in log scale, and the populations are shown on the x axis in millions. The state name is indicated by the text label next to the points, and the color designates the state region. The average murder rate in the US was added as a gray line (in millions).
]]
???

Each state in the dataset was identified in this plot as a colored point with a label next to it. The total number of murders is shown on the y axis in log scale, and the populations are shown on the x axis in millions. The state name is indicated by the text label next to the points, and the color designates the state region. The average murder rate in the US was added as a gray line (in millions). This picture is much more informative then the dataset itself.

---

### Why we need data visualization

We are reminded of the saying **"a picture is worth a thousand words"**. Data visualization provides a powerful way to communicate a data-driven finding.

Data visualization is the strongest tool of what we call _exploratory data analysis_ (EDA). [John W. Tukey](https://en.wikipedia.org/wiki/John_Tukey), considered the father of EDA, once said,

>> "The greatest value of a picture is when it forces us to notice what we never expected to see."

Many widely used data analysis tools were initiated by discoveries made via EDA. EDA is perhaps the most important part of data analysis, yet it is one that is often overlooked.

The growing availability of informative datasets and software tools has led to increased reliance on **data visualizations** across many industries, academia, and government.

---

### Benefits of data visualization:

**Communication**: Data visualization provides a powerful way to communicate complex information to both technical and non-technical audiences.

**Exploration**: Data visualization allows us to explore data and identify patterns or trends that may not be apparent from numerical summaries alone.

**Identification of errors or outliers**: Data visualization can help us identify potential errors or outliers in our data that may impact our analysis.

**Hypothesis generation**: Data visualization can help generate new hypotheses or questions for further investigation.

---

## Another example

.panelset[
.panel[.panel-name[Picture]

<img src="lecture6_files/figure-html/wsj-vaccines-example-1.png" width="100%" />
]
.panel[.panel-name[Code]
.tiny[

```r
#knitr::include_graphics(file.path(img_path,"wsj-vaccines.png"))
data(us_contagious_diseases)
the_disease <- "Measles"
dat <- us_contagious_diseases |>
  filter(!state%in%c("Hawaii","Alaska") & disease == the_disease) |>
  mutate(rate = count / population * 100000 * 52 / weeks_reporting) |>
  mutate(state = reorder(state, rate))
jet.colors <-
colorRampPalette(c("#F0FFFF", "cyan", "#007FFF", "yellow", "#FFBF00", "orange", "red", "#7F0000"), bias = 2.25)
the_breaks <- seq(0, 4000, 1000)
dat |> ggplot(aes(year, state, fill = rate)) +
  geom_tile(color = "white", size=0.35) +
  scale_x_continuous(expand=c(0,0)) +
  scale_fill_gradientn(colors = jet.colors(16), na.value = 'white',
                       breaks = the_breaks, 
                       labels = paste0(round(the_breaks/1000),"k"),
                       limits = range(the_breaks),
                       name = "") +
  geom_vline(xintercept=1963, col = "black") +
  theme_minimal() + 
  theme(panel.grid = element_blank()) +
  coord_cartesian(clip = 'off') +
  ggtitle(the_disease) +
  ylab("") +
  xlab("") +  
  theme(legend.position = "bottom", text = element_text(size = 8)) + 
  annotate(geom = "text", x = 1963, y = 50.5, label = "Vaccine introduced", size = 3, hjust=0)
```
]]
.panel[.panel-name[explanation]
A particularly effective example is a [Wall Street Journal article](http://graphics.wsj.com/infectious-diseases-and-vaccines/?mc_cid=711ddeb86e) showing data related to the impact of vaccines on battling infectious diseases.

One of the graphs shows measles cases by US state through the years with a vertical line demonstrating when the vaccine was introduced.

.tiny[The plot shows the incidence rate of Measles in US states over time (years on the x-axis), represented by colored tiles for each state (on the y-axis). The incidence rate is calculated as the number of cases per 100,000 population per week, averaged over 52 weeks and adjusted for the number of weeks reporting data. States are sorted by their incidence rates, from lowest to highest, and are colored according to a gradient color scale, ranging from blue (low incidence rates) to red (high incidence rates). The plot includes a vertical line indicating the year when the Measles vaccine was introduced (1963). The plot is useful for visualizing how Measles incidence rates varied across US states over time, and how the introduction of the vaccine impacted the incidence rates.]
]]

???

A particularly effective example is a [Wall Street Journal article](http://graphics.wsj.com/infectious-diseases-and-vaccines/?mc_cid=711ddeb86e) showing data related to the impact of vaccines on battling infectious diseases.

The plot shows the incidence rate of Measles in US states over time (years on the x-axis), represented by colored tiles for each state (on the y-axis). The incidence rate is calculated as the number of cases per 100,000 population per week, averaged over 52 weeks and adjusted for the number of weeks reporting data. States are sorted by their incidence rates, from lowest to highest, and are colored according to a gradient color scale, ranging from blue (low incidence rates) to red (high incidence rates). The plot includes a vertical line indicating the year when the Measles vaccine was introduced (1963). The plot is useful for visualizing how Measles incidence rates varied across US states over time, and how the introduction of the vaccine impacted the incidence rates.

---

In the talks [NewInsights on Poverty](https://www.ted.com/talks/hans_rosling_reveals_new_insights_on_poverty?language=en), Hans Rosling forces us to notice the unexpected with a series of plots related to world health and economics.

![](img/gapminder.gif)

???

The plot shows a scatterplot of life expectancy on the y-axis against fertility rate on the x-axis, with different colors and sizes of points representing different regions of the world (The West, East Asia, Latin America, Sub-Saharan Africa, and Others) and population sizes, respectively. The plot is animated over time (1962-2013), showing how the relationship between life expectancy and fertility rate changes over time for each region. The plot is useful for visualizing how life expectancy and fertility rate have changed over time for different regions of the world, and how different regions compare to one another.

---

## Data visualization using `ggplot2`

.panelset[
.panel[.panel-name[Slide]

.pull-left[
<img src="img/ggplot2-part-of-tidyverse.png" width="80%" />
] 
.pull-right[ 
- ggplot2 is the tidyverse's data visualization package

- create relatively **complex** and **aesthetically pleasing** plots

- syntax is **intuitive** and comparatively easy to remember.

- `gg` in "ggplot2" stands for Grammar of Graphics

- Inspired by the book Grammar of Graphics by Leland Wilkinson

]]

.panel[.panel-name[Words 1]

Throughout the lecture, we will be creating plots using the __ggplot2__^[https://ggplot2.tidyverse.org/] package.

Many other approaches are available for creating plots in R. We chose to use __ggplot2__ because it breaks plots into components in a way that permits beginners to create relatively **complex** and **aesthetically pleasing** plots using syntax that is **intuitive** and comparatively easy to remember.

One reason __ggplot2__ is generally more intuitive for beginners is that it uses a grammar of graphics^[http://www.springer.com/us/book/9780387245447], the _gg_ in __ggplot2__. This is analogous to the way learning grammar can help a beginner construct hundreds of different sentences by learning just a handful of verbs, nouns and adjectives without having to memorize each specific sentence. Similarly, by learning a handful of __ggplot2__ building blocks and its grammar, you will be able to create hundreds of different plots.

]
.panel[.panel-name[Words 2]

Another reason __ggplot2__ is easy for beginners is that it is possible to create informative and elegant graphs with relatively simple and readable code.

To use __ggplot2__  you will have to learn several functions and arguments. These commands may be hard to memorize, but you can always return back to this tutorial and grab the code you want. Or you can simply perform an internet search for [ggplot2 cheat sheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf).

]]

???

Throughout the lecture, we will be creating plots using the __ggplot2__^[https://ggplot2.tidyverse.org/] package.

Another reason __ggplot2__ is easy for beginners is that it is possible to create informative and elegant graphs with relatively simple and readable code.

---

## Next week:

Learn how to use `ggplot2` to generate the first example!

---

# Readings

- [Chapter 4:The tidyverse](http://rafalab.dfci.harvard.edu/dsbook/tidyverse.html)

- [Data Wrangling with Tidyverse](https://hbctraining.github.io/Intro-to-R/lessons/tidyverse_data_wrangling.html)

- [Chapter 7:Introduction to data visualization](http://rafalab.dfci.harvard.edu/dsbook/introduction-to-data-visualization.html)