class: center, middle, inverse, title-slide .title[ # Describing Numerical and Categorical Data ] .subtitle[ ##
STA 032: Gateway to data science Lecture 11 ] .author[ ### Jingwei Xiong ] .date[ ### April 26, 2023 ] --- <style type="text/css"> .tiny .remark-code { font-size: 60%; } .small .remark-code { font-size: 80%; } </style> ## Reminders - HW 2 due April 26 12pm. - HW 3 due May 3 12pm. - Please start the homework as soon as possible. - **Midterm 1** (Open book, take home, approximate 2 hours, time limit 5 hours) - **Due April 29 midnight, cover lecture 1-12** - Preparing guides: Finish Homework 1-3, be familiar to the lecture slides. - You can copy any your own codes in homework 1-3 to finish the open book exam. - The PDF version of slides can be found on Canvas. - You can use search on it to find function examples. --- ## Recap - Relationships between numerical variables - Correlation coefficient - Line graph - Describing categorical distributions - Bar plot - Relationships between categorical data - Contingency tables --- ## Today - Relationships between categorical data - Stacked bar plot - Color pallettes - side by side bar plot - [Customizing bar plots](#customizing-slide) - Relationships between numerical and categorical data - Fill and facet - Side-by-side boxplots - Other fancy plots - Numerical summaries in R ??? Firstly, we will discuss how to create a stacked bar plot to represent the relationship between two categorical variables. Next, we will move on to explore the relationship between numerical and categorical variables by creating fill and facet plots. We will also learn how to create side-by-side boxplots to visualize the differences between numerical variables by categorical variables. Additionally, we will look at some other fancy plots that can be used for this purpose. Finally, we will discuss the different numerical summaries that can be used in R to summarize and compare numerical data by categorical variables. --- ## Data: Lending Club - Lending Club is a platform that allows individuals to lend to other individuals ```r loans <- loans_full_schema %>% select(loan_amount, interest_rate, term, grade, state, annual_income, homeownership, debt_to_income, issue_month) glimpse(loans) ``` ``` Rows: 10,000 Columns: 9 $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000, 24000, 20000, 20… $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, 13.59, 11.99, 1… $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, 60, 60, 36, 60,… $ grade <fct> C, C, D, A, C, A, C, B, C, A, C, B, C, B, D, D, D, F, E… $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, IL, FL, SC, CO,… $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 34000, 35000, 110000… $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, MORTGAGE, MORTGA… $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, 23.66, 16.19, 3… $ issue_month <fct> Mar-2018, Feb-2018, Feb-2018, Jan-2018, Mar-2018, Jan-2… ``` ??? We will still use the Lending Club data. --- ## Stacked bar plot .panelset[ .panel[.panel-name[Overview] - A stacked bar plot looks at numeric values across two categorical variable - Each bar in a standard bar plot is divided into stacked sub-bars, each one corresponding to a level of the second categorical variable. ] .panel[.panel-name[Plot] ```r ggplot(loans, aes(x = homeownership, * fill = grade)) + geom_bar() ``` <img src="lecture11_files/figure-html/unnamed-chunk-5-1.png" width="504" /> ] .panel[.panel-name[Change color] Change the color with `scale_fill_viridis_d(option = "D")` or `scale_colour_viridis_d(option = "D")`. - option: A character string indicating the color map option to use. Eight options are available: - "magma" (or "A") - "inferno" (or "B") - "plasma" (or "C") - "viridis" (or "D") - "cividis" (or "E") - "rocket" (or "F") - "mako" (or "G") - "turbo" (or "H") ] .panel[.panel-name[Plot 2] ```r ggplot(loans, aes(x = homeownership, fill = grade)) + geom_bar() + * scale_fill_viridis_d(option = "E") ``` <img src="lecture11_files/figure-html/unnamed-chunk-6-1.png" width="504" /> ] ] ??? In a stacked bar plot, numeric values are compared across two categorical variables. Each bar in the plot is divided into stacked sub-bars, where each sub-bar corresponds to a level of the second categorical variable. In the example plot shown on this slide, the x-axis represents the categorical variable homeownership, and the fill color represents the categorical variable grade. The plot shows the distribution of loans across different grades for each level of homeownership. We can Change the color with `scale_fill_brewer(palette = "name")` or `scale_color_brewer(palette = "name")`. It Control the `fill` or `color` respectively. For bar chart we use `fill` in aes function so we need to use the scale_fill_brewer. Here are some possible parameters for palette. The name need to be quoted. --- ## Stacked bar plot: ordinal variable fill .panelset[ .panel[.panel-name[ordinal variable] Turning `grade` into an ordered variable makes `ggplot` use the `viridis` scale by default. The `ggplot` will also try to order the variable in plot. ```r str(loans$grade) ``` ``` Factor w/ 8 levels "","A","B","C",..: 4 4 5 2 4 2 4 3 4 2 ... ``` ```r loans <- loans %>% mutate(grade = factor(grade, ordered = TRUE)) str(loans$grade) ``` ``` Ord.factor w/ 7 levels "A"<"B"<"C"<"D"<..: 3 3 4 1 3 1 3 2 3 1 ... ``` ] .panel[.panel-name[Example] ```r ggplot(loans, aes(x = homeownership, * fill = grade)) + geom_bar() ``` <img src="lecture11_files/figure-html/unnamed-chunk-8-1.png" width="504" /> ]] ??? The ggplot will also try to order the grade variable in the plot. By default, ggplot uses the viridis scale when the filling variable is ordered. In the example, we have converted the grade variable to an ordered factor using factor(grade, ordered = TRUE). Then we are using ggplot to create a stacked bar plot of the homeownership variable and filling it with grade. --- ## Side by side bar plot .panelset[ .panel[.panel-name[Side by side] ```r ggplot(loans, aes(x = homeownership, fill = grade)) + * geom_bar(position = 'dodge') ``` <img src="lecture11_files/figure-html/unnamed-chunk-9-1.png" width="504" /> ] .panel[.panel-name[Swap the categories] ```r ggplot(loans, aes(x = grade, fill = homeownership)) + * geom_bar(position = 'dodge') ``` <img src="lecture11_files/figure-html/unnamed-chunk-10-1.png" width="504" /> ] ] ??? The first one shows the distribution of loans based on the grade and homeownership. The second one swaps the two categorical variables, so now we have the distribution of loans based on homeownership for each grade level. Both plots use the position = 'dodge' argument to show the bars side-by-side. --- ## Stacked bar plot Adding `position = "fill"` argument changes visualization to proportions, and standardizes the height of columns into proportions. ```r ggplot(loans, aes(x = homeownership, fill = grade)) + * geom_bar(position = "fill") ``` <img src="lecture11_files/figure-html/unnamed-chunk-11-1.png" width="504" /> --- ## Stacked bar plot: counts vs. proportions Which bar plot is a more useful representation for visualizing the relationship between homeownership and grade? .pull-left[ <img src="lecture11_files/figure-html/unnamed-chunk-12-1.png" width="100%" /> ] .pull-right[ <img src="lecture11_files/figure-html/unnamed-chunk-13-1.png" width="100%" /> ] If there were no relationship between homeownership and grade, we would expect to see the bars to be similar lengths across the homeownership status (columns). --- ## Stacked bar plot: counts vs. proportions Is there a relationship between homeownership and grade? .pull-left[ <img src="lecture11_files/figure-html/unnamed-chunk-14-1.png" width="100%" /> ] .pull-right[ <img src="lecture11_files/figure-html/unnamed-chunk-15-1.png" width="100%" /> ] --- name: customizing-slide ## Customizing bar plots .panelset[ .panel[.panel-name[Plot] <img src="lecture11_files/figure-html/unnamed-chunk-16-1.png" width="504" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(y = homeownership, fill = grade)) + geom_bar(position = "fill") + * labs( * x = "Proportion", * y = "Homeownership", * fill = "Grade", * title = "Grades of Lending Club loans", * subtitle = "and homeownership of lendee" * ) + * scale_fill_viridis_d(option = "D") ``` ] .panel[.panel-name[Change color] Change the color with `scale_fill_viridis_d(option = "D")` or `scale_colour_viridis_d(option = "D")`. - option: A character string indicating the color map option to use. Eight options are available: - "magma" (or "A") - "inferno" (or "B") - "plasma" (or "C") - "viridis" (or "D") - "cividis" (or "E") - "rocket" (or "F") - "mako" (or "G") - "turbo" (or "H") ] ] ??? We can make additional customizations include setting the x, y axis labels, legend label of fill, and plot title using labs(), and using the scale_fill_brewer() function to set a color palette for the plot. --- ## Relationships between numerical and categorical data - We saw histograms, boxplots, and density plots earlier, for describing a single numerical variable - To look at relationships between these numerical data and a categorical variable, we can: - Fill and facet histograms and density plots - Use side-by-side boxplots - Violin plots - Ridge plots - Numerical summaries - `group_by()` and output table ??? Then we will look at how to visualize the relationship between a numerical variable and a categorical variable. We have seen histograms, boxplots, and density plots earlier for describing a single numerical variable. We can now extend these to look at relationships between numerical and categorical data. With ggplot2 we can easily create powerful and informative plots to illustrate relationships inside dataset. Some ways to do this include fill and facet histograms and density plots, side-by-side boxplots, violin plots, and ridge plots. Additionally, we will also look at numerical summaries of data using group_by() function in R. --- ## Fill a histogram with a categorical variable Is there a relationship between loan amount and home-ownership status? .panelset[ .panel[.panel-name[Plot] <img src="lecture11_files/figure-html/unnamed-chunk-17-1.png" width="504" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = loan_amount, * fill = homeownership)) + geom_histogram(binwidth = 5000, * alpha = 0.5) + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) + scale_fill_viridis_d() ``` ] ] ??? We ask one question: Is there a relationship between loan amount and home-ownership status? To explore it, we can use a histogram and fill it with the homeownership status. However, the histogram bars are overlapping, which makes it difficult to see the distribution of each category separately. --- ## Fill a histogram with a categorical variable Is there a relationship between loan amount and home-ownership status? - Need `position = "dodge"` argument if we don't want histogram bars to be stacked on top of one another .panelset[ .panel[.panel-name[Plot] <img src="lecture11_files/figure-html/unnamed-chunk-18-1.png" width="504" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = loan_amount, fill = homeownership)) + * geom_histogram(position = "dodge", binwidth = 5000, alpha = 0.5) + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) + scale_fill_viridis_d() ``` ] ] ??? To address this issue, we can use the position = "dodge" argument to create side-by-side histograms for each category, as shown in the second plot. From this plot we can realize though all 3 ownership happens most on Loan amount around 10000, when Loan amount increase, the OWN type seems to decrease more than the mortage. --- ## Facet a histogram with a categorical variable Is there a relationship between loan amount and home-ownership status? .panelset[ .panel[.panel-name[Plot] <img src="lecture11_files/figure-html/unnamed-chunk-19-1.png" width="504" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = loan_amount, fill = homeownership)) + geom_histogram(binwidth = 5000) + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) + * facet_wrap(~ homeownership, nrow = 3) + scale_fill_viridis_d() ``` ] ] ??? Another way is to facet by the category variable. Here we demonstrate ow we can facet a histogram with a categorical variable using ggplot2 in R using the facet_wrap function to create a separate histogram for each level of the homeownership variable. --- ## Filling density plots with a categorical variable Is there a relationship between loan amount and home-ownership status? .panelset[ .panel[.panel-name[Plot] <img src="lecture11_files/figure-html/unnamed-chunk-20-1.png" width="504" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = loan_amount, * fill = homeownership)) + geom_density(adjust = 2, # adjust for smooth parameter * alpha = 0.5) + labs( x = "Loan amount ($)", y = "Density", title = "Amounts of Lending Club loans", * fill = "Homeownership" ) + scale_fill_viridis_d() ``` ] ] ??? In this slide, we are filling density plots with a categorical variable to explore the relationship between loan amount and home-ownership status. The plot shows the density distribution of loan amounts for each category of the homeownership variable. We use the fill argument in the ggplot() function to specify the categorical variable, homeownership. We also set alpha to 0.5 to make the overlapping density plots more visible, and adjust to 2 to control the degree of smoothing. Finally, we use scale_fill_viridis_d() to set the color palette for the fill variable. --- ## Side-by-side boxplots Is there a relationship between loan amount and home-ownership status? .panelset[ .panel[.panel-name[Plot] <img src="lecture11_files/figure-html/unnamed-chunk-21-1.png" width="504" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = loan_amount, * y = homeownership)) + geom_boxplot() + labs( x = "Loan amount ($)", y = "Home-ownership status", * title = "Amounts of Lending Club loans" ) ``` ] ] ??? We can use side-by-side boxplots to compare the distribution . The x-axis shows the loan amount, and the y-axis shows the categories of the homeownership variable. By making X axis continuous, Y axis category, we make the boxplot horizontal. --- ## Violin plots Is there a relationship between loan amount and home-ownership status? - A violin plot is a hybrid of a boxplot and a density plot ```r ggplot(loans, aes(x = homeownership, y = loan_amount)) + geom_violin() ``` <img src="lecture11_files/figure-html/unnamed-chunk-22-1.png" width="504" /> ??? In addition to boxplots, we can also use violin plots to look at the relationship between loan amount and home-ownership status. A violin plot is a hybrid of a boxplot and a density plot, providing information on both the distribution and the quartile values. Here, we can see that the distribution of loan amount is quite similar for the two groups, with slightly higher values for non-homeowners. --- #### Ridge plots ```r library(ggridges) ggplot(loans, aes(x = loan_amount, y = homeownership, fill = homeownership)) + geom_density_ridges(alpha = 0.5) + scale_fill_viridis_d() ``` <img src="lecture11_files/figure-html/unnamed-chunk-23-1.png" width="504" /> ??? You can also use a ridge plot to display the distribution of loan amounts for each homeownership status. A ridge plot is a type of density plot that allows multiple distributions to be plotted in a single chart by stacking them on top of each other. In this case, each distribution corresponds to a different homeownership status. The ggridges package provides an easy way to create ridge plots in ggplot2. --- #### Ridge plots Ridge plots can also be used to investigate more complicated relationships, such as those between a categorical and numerical variable, conditional on another categorical variable Here, we consider the relationship between loan grade and loan amount, conditional on each level of home ownership ```r ggplot(loans, aes(x = loan_amount, y = homeownership, fill = grade, color = grade)) + geom_density_ridges(alpha = 0.5) ``` <img src="lecture11_files/figure-html/unnamed-chunk-24-1.png" width="504" /> --- #### Ridge plots Here, we consider the relationship between loan grade and loan amount, conditional on each level of home ownership ```r ggplot(loans, aes(x = loan_amount, y = homeownership, fill = grade, color = grade)) + geom_density_ridges(alpha = 0.5) ``` <img src="lecture11_files/figure-html/unnamed-chunk-25-1.png" width="504" /> Interestingly, those who had mortgages tend to have a higher proportion of grade G loans that have higher loan amounts. --- ## Numerical summaries in R, grouping by a categorical variable Question: `homeownership` is a factor variable with three levels, `MORTGAGE`, `OWN` and `RENT`. How do we calculate the mean loan amount for each type of home ownership status? -- ```r loans %>% group_by(homeownership) %>% summarize(meanLoan = mean(loan_amount)) ``` ``` # A tibble: 3 × 2 homeownership meanLoan <fct> <dbl> 1 MORTGAGE 18129. 2 OWN 15684. 3 RENT 14406. ``` -- But how to make the result output as a table, not in R output format? --- ## knitr::kable function for output We can use the `knitr::kable` function. ```r t1 = loans %>% group_by(homeownership) %>% summarize(meanLoan = mean(loan_amount)) knitr::kable(t1, caption = "Summary table 1", digits=1, align = "l") ``` Table: Summary table 1 |homeownership |meanLoan | |:-------------|:--------| |MORTGAGE |18129.0 | |OWN |15683.9 | |RENT |14406.2 | --- ## What descriptive statistics to use? Which plots to produce? - Fancier does not always mean better; a pretty plot can look great but tell us nothing. - Complex does not always mean better; you may be lost in too much information. - Think about what question you are trying to answer and pick the figure that best suits the purpose. - Hadley Wickham on exploratory data analysis: "EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind." (Chapter 7, R for Data Science) - Extra examples: [Chapter 11 Data visualization principles](http://rafalab.dfci.harvard.edu/dsbook/data-visualization-principles.html) --- ## Other fancy plots example: * [Pie chart](https://r-graph-gallery.com/piechart-ggplot2.html) * [The hourly heatmap](https://r-graph-gallery.com/283-the-hourly-heatmap.html) * [Bubble plot](https://r-graph-gallery.com/320-the-basis-of-bubble-plot.html) * [Choropleth Map](https://r-graph-gallery.com/choropleth-map.html) --- ## Summary -- - Relationships between categorical data - Contingency tables - Stacked bar plot - Relationships between numerical and categorical data - Fill and facet - Side-by-side boxplots - Other fancy plots - Numerical summaries in R --- ## Reading - [Chapter 10 Data visualization in practice](http://rafalab.dfci.harvard.edu/dsbook/gapminder.html) - [Chapter 11 Data visualization principles](http://rafalab.dfci.harvard.edu/dsbook/data-visualization-principles.html)