class: center, middle, inverse, title-slide .title[ # Introduction and R Basics ] .subtitle[ ##
STA 032: Gateway to data science Lecture 3 ] .author[ ### Jingwei Xiong ] .date[ ### April 7, 2023 ] --- <style type="text/css"> .tiny .remark-code { font-size: 60%; } .small .remark-code { font-size: 80%; } </style> ## Reminders - Homework 1 has been assigned: (Due April 17 midnight, cover lecture 1-4) - start as soon as possible - PDF files only - Submission through Gradescope (accessible through Canvas) - If you get stuck, it's encouraged to communicate with your mate about solution. - But you should type your solution by your own. - If you collaborate with others, write their names in your submission - Office hours: - TBD --- ## Today - Vectors arithmetics - Logical subsetting - Installing packages - Function basics --- ## Questions from last class: NA vs. NaN - `NaN` means "not a number" and it means there is a result, but it cannot be represented by the computer ```r 0 / 0 # note that 1 / 0 returns Inf ``` ``` [1] NaN ``` - `NA` means missing; when working with data sets this is the more common one you will encounter - `is.na()` returns `TRUE` for both missing values (`NA`) and `NaN` ```r is.na(0 / 0) ``` ``` [1] TRUE ``` ```r NA + NaN ``` ``` [1] NA ``` - For more, see https://jameshoward.us/2016/07/18/nan-versus-na-r/. --- ## Recap: Vector covered last lecture: * How to create a vector: `c()`, `1:5`, `seq()`, `vector(length = 7)` * How to subset a vector using index * How to subset a vector using name And today we will continue with more on vectors. ??? In our last lecture, we covered some basic concepts related to vectors in R. We learned about various ways of creating a vector, such as using the c() function, 1:5, seq(), and vector(). We also learned about how to subset a vector using index and name. Today, we will continue our discussion on vectors and cover some more topics. Specifically, we will look at factors, dates, some common operations that can be performed on vectors, such as arithmetic operations and logical operations. We will also discuss some important functions for working with vectors in R. So, let's dive in! --- ## Vector arithmetic with a constant In R, arithmetic operations on vectors occur _element-wise_. For a quick example, suppose we have height in inches: ```r inches <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70) ``` and want to convert to centimeters. Notice what happens when we multiply `inches` by 2.54: ```r inches * 2.54 ``` ``` [1] 175.26 157.48 167.64 177.80 177.80 185.42 170.18 185.42 170.18 177.80 ``` In the line above, we multiplied each element by 2.54. Similarly, if for each entry we want to compute how many inches taller or shorter than 69 inches, the average height for males, we can subtract it from every entry like this: ```r inches - 69 ``` ``` [1] 0 -7 -3 1 1 4 -2 4 -2 1 ``` --- ## Vector arithmetic: Two vectors If we have two vectors of the same length, and we sum them in R, they will be added entry by entry as follows: $$ `\begin{pmatrix} a\\ b\\ c\\ d \end{pmatrix}` + `\begin{pmatrix} e\\ f\\ g\\ h \end{pmatrix}` = `\begin{pmatrix} a +e\\ b + f\\ c + g\\ d + h \end{pmatrix}` $$ The same holds for other mathematical operations, such as `-`, `*` and `/`. --- ## Example: Vector arithmetic: Two vectors ```r x <- c(7, 8, 10, 45) y <- c(-7, -8, -10, -45) x + y ``` ``` [1] 0 0 0 0 ``` ```r x * y ``` ``` [1] -49 -64 -100 -2025 ``` ```r x^c(1, 0, -1, 0.5) ``` ``` [1] 7.000000 1.000000 0.100000 6.708204 ``` ??? In the example code shown on the slide, we create two vectors x and y of the same length and perform some arithmetic operations on them. We can see that the addition and multiplication operations are performed element-wise, which means that each corresponding element of x is added or multiplied by the corresponding element of y. We also perform an exponentiation operation on x using a vector of exponents c(1, 0, -1, 0.5). Again, the exponentiation operation is performed element-wise, so each element of x is raised to the corresponding exponent in the exponent vector. --- ## Recycling - R will also implicitly coerce the length of vectors. - This is called vector **recycling**: - When a shorter vector is combined with a longer one, elements of the shorter vector are repeated or recycled, to make it the same length as the longer vector. ```r x <- c(7, 8, 10, 45) x + c(-7, -8) ``` ``` [1] 0 0 3 37 ``` Single numbers are vectors of length 1 for purposes of recycling: ```r 2*x ``` ``` [1] 14 16 20 90 ``` ??? In addition to performing element-wise operations on vectors, R also has the ability to implicitly coerce the length of vectors. This is achieved through a process known as vector recycling. When a shorter vector is combined with a longer one, R will repeat or recycle the elements of the shorter vector until it is the same length as the longer vector. In the example code shown on the slide, we create a vector x of length 4 and add it to a vector of length 2. R automatically recycles the shorter vector to make it the same length as the longer vector, so the addition operation can be performed. We also show that single numbers are considered vectors of length 1 for the purposes of vector recycling. In the example, we multiply x by the scalar value 2, and R recycles the value of 2 to make it the same length as x. --- ## Vectorized functions: Examples Most built-in functions are vectorized, meaning that they will operate on a vector of numbers. Here are some examples about functions taking vector as input: ```r sample(1:10) + 100 ``` ``` [1] 107 110 102 109 103 104 101 106 105 108 ``` <small>(what does `sample()` do?)</small> Operator also work as functions: ```r x ``` ``` [1] 7 8 10 45 ``` ```r x > 9 # pairwise comparisons, where the scalar 9 is recycled ``` ``` [1] FALSE FALSE TRUE TRUE ``` --- ## Vectorized functions Lots of functions take vectors as arguments: - `mean()`, `median()`, `sd()`, `var()`, `max()`, `min()`, `length()`, `sum()`: return single numbers - `sort()` returns a new vector - `hist()` takes a vector of numbers and produces a histogram - `summary()` gives a five-number summary of numerical vectors - `any()` and `all()` are useful on Boolean vectors ??? Some common functions that take vectors as arguments include mean(), median(), sd(), var(), max(), min(), length(), and sum(). These functions return a single number that summarizes some aspect of the vector. For example, mean() returns the arithmetic mean of the values in the vector, while length() returns the number of elements in the vector. The sort() function is another function that takes a vector as an argument and returns a new vector with the same elements sorted in ascending or descending order. The hist() function is used to create a histogram of the values in a vector, and the summary() function provides a summary of numerical vectors, including the minimum, maximum, median, and quartiles. Finally, the any() and all() functions are useful for Boolean vectors. any() returns TRUE if at least one element of a Boolean vector is TRUE, and FALSE otherwise. all() returns TRUE if all elements of a Boolean vector are TRUE, and FALSE otherwise. --- ### Vector subsetting using conditions Because Boolean operators work elementwise, We can use logical operators and comparison operators to specify a condition, and R will return Boolean vector that is TRUE for the elements that meet the condition and FALSE for the others. ```r ages <- c(23, 35, 28, 19, 42, 30, 38, 27) (ages > 25) & (ages < 40) ``` ``` [1] FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE ``` * Use the Boolean vector as an index for the original vector. ```r ages[(ages > 25) & (ages < 40)] ``` ``` [1] 35 28 30 38 27 ``` * To get the number of components that satisfy a certain condition: This works because R coerces the TRUE and FALSE values to 1 and 0, respectively. ```r sum((ages > 25) & (ages < 40)) # another example of coercion ``` ``` [1] 5 ``` ??? Boolean operators in R work element-wise, which means that we can use logical and comparison operators to specify a condition, and R will return the subset of the vector that satisfies the condition. In the example code shown on the slide, we create a vector of ages called ages and use the comparison operators > and < to create a logical condition that returns TRUE for ages between 25 and 40. We can see that this returns a Boolean vector that is TRUE for the elements that meet the condition and FALSE for the others. To actually subset the vector and return the elements that meet the condition, we can use the Boolean vector as an index for the original vector. In the example code, we use ages[(ages > 25) & (ages < 40)] to subset the ages vector and return the ages between 25 and 40. To get the number of components that satisfy a certain condition, we can use the sum() function and pass in the Boolean vector as an argument. This works because R coerces the TRUE and FALSE values to 1 and 0, respectively. In the example code, we use sum((ages > 25) & (ages < 40)) to return the number of ages between 25 and 40. --- ## The which() Function This function returns the indices of the elements in a vector that satisfy a certain condition. For example, let's say we have a vector of test scores: ```r scores <- c(78, 82, 91, 64, 87, 76, 93, 80) ``` We can use the which() function to return the indices of the scores that are greater than 80: ```r which(scores > 80) ``` ``` [1] 2 3 5 7 ``` ??? Another useful function for working with vectors in R is the which() function. This function returns the indices of the elements in a vector that satisfy a certain condition. For example, let's say we have a vector of test scores: scores <- c(78, 82, 91, 64, 87, 76, 93, 80) We can use the which() function to return the indices of the scores that are greater than 80: This will return a new vector containing the indices 3, 5, 7, and 8, which correspond to the scores that are greater than 80. which(scores > 80) --- We can also use the which() function to return the indices of the minimum or maximum values in a vector. ```r which.min(scores) ``` ``` [1] 4 ``` With these index we can easily subset the vector. ```r scores[which(scores > 80)] # subset by index ``` ``` [1] 82 91 87 93 ``` ```r scores[scores > 80] # subset by boolen values ``` ``` [1] 82 91 87 93 ``` ```r # Why they are the same? scores[which.max(scores)] ``` ``` [1] 93 ``` ??? In addition to returning the indices of the elements that satisfy a certain condition, the which() function can also be used to return the indices of the minimum or maximum values in a vector. In the example code, we have a vector of test scores called scores, and we use the which.min() function to return the index of the minimum score. We can then use this index to subset the scores vector and return the value of the minimum score. We can also use the which() function together with logical or comparison operators to return the indices of elements that meet certain conditions. In the example code, we use which(scores > 80) to return the indices of the scores that are greater than 80. Once we have the indices of the elements that meet a certain condition or the minimum or maximum value, we can use them to subset the vector. In the example code, we use the index vector returned by which(scores > 80) to subset the scores vector and return the scores that are greater than 80. We can also use Boolean values to subset the vector directly, as shown in the example code. When we use Boolean values to subset the vector, R automatically returns the values that meet the condition and discards the ones that do not. Finally, we use which.max() function to return the index of the maximum value of the vector, and then use that index to return the maximum value of the scores vector. --- ## The %in% Operator Another useful operator for working with vectors in R is the %in% operator. This operator allows us to test whether the elements of one vector are present in another vector. For example, let's say we have a vector of names: ```r names <- c("Alice", "Bob", "Charlie", "David", "Eve") ``` We can use the %in% operator to test whether a vector of search terms is present in the names vector: ```r search_terms <- c("Bob", "Eve", "Frank") search_terms %in% names ``` ``` [1] TRUE TRUE FALSE ``` ```r names[names %in% c("Bob", "Charlie", "Frank")] ``` ``` [1] "Bob" "Charlie" ``` ??? Another useful operator for working with vectors in R is the %in% operator. This operator allows us to test whether the elements of one vector are present in another vector. For example, let's say we have a vector of names: names <- c("Alice", "Bob", "Charlie", "David", "Eve") We can use the %in% operator to test whether a vector of search terms is present in the names vector: search_terms <- c("Bob", "Eve", "Frank") search_terms %in% names This will return a Boolean vector that is TRUE for the elements that are present in the names vector and FALSE for the others. In this example, the output is TRUE TRUE FALSE, indicating that "Bob" and "Eve" are present in the names vector, but "Frank" is not. We can also use the %in% operator to subset a vector based on whether its elements are present in another vector. For example, we can use the names[names %in% c("Bob", "Charlie", "Frank")] expression to subset the names vector and return the names that are present in the search terms vector. In summary, the %in% operator is a useful tool for working with vectors in R, and it can be used to test whether the elements of one vector are present in another vector or to subset a vector based on whether its elements are present in another vector. --- ## Comparison operators When we want to compare two vectors element-wise, we can use Boolean operators like == However, to compare **whole** vectors, best to use `identical()`: ```r x; y ``` ``` [1] 7 8 10 45 ``` ``` [1] -7 -8 -10 -45 ``` ```r x == -y ``` ``` [1] TRUE TRUE TRUE TRUE ``` ```r identical(x, -y) ``` ``` [1] TRUE ``` ??? When we want to compare two vectors element-wise, we can use Boolean operators like ==. However, when we want to compare two whole vectors, it's best to use functions like identical() or all.equal(). In the example code shown on the slide, we have two vectors called x and y. We use the Boolean operator == to compare the two vectors element-wise, but this does not return the expected result. Instead, we get a Boolean vector that is TRUE for the elements that satisfy the condition x[i] == -y[i] and FALSE for the others. To compare the two vectors as a whole, we can use the identical() function. This function returns TRUE if the two vectors are exactly the same, and FALSE otherwise. In the example code, we use identical(x, -y) to check whether the two vectors are equal, and we get a Boolean value of FALSE. --- ## Example 1: counting the number of missing values It's common to encounter missing values (NA) in vectors when working with real-world data. In R, we can use the is.na() function to check for missing values in a vector, and the sum() function to count the number of missing values. ```r myNAvec = c(1,2,NA,4,NA) is.na(myNAvec) ``` ``` [1] FALSE FALSE TRUE FALSE TRUE ``` ```r sum(is.na(myNAvec)) ``` ``` [1] 2 ``` ??? It's common to encounter missing values (NA) in vectors when working with real-world data. In R, we can use the is.na() function to check for missing values in a vector, and the sum() function to count the number of missing values. In the example code shown on the slide, we have a vector called myNAvec that contains the values 1,2,NA,4,NA, We use the is.na() function to create a Boolean vector that is TRUE for the missing values and FALSE for the other values. We can see that this returns a vector that has the same length as the myNAvec vector and is TRUE for the third and fifth element. To count the number of missing values in the myNAvec vector, we can use the sum() function on the Boolean vector returned by is.na(). This works because R coerces the TRUE and FALSE values to 1 and 0, respectively. --- ## Example 2: Calculate the summation If we want to compute: `\(1+\frac{1}{2^2}+\frac{1}{3^2}+...+\frac{1}{1000^2}\)` We first define a vector contains numbers from `\(1\)` to `\(1000\)`, then square that vector, and then use single number `\(1\)` divided by the squared vector. By vector arithmetic, the result will be `\(1\)` divided by each element of the vector, which will be a vector of `\(1,\frac{1}{2^2},...,\frac{1}{1000^2}\)`. Then use the function `sum` we can obtain the summation. In one line, it is: ```r sum(1 / (1:1000)^2 ) ``` ``` [1] 1.643935 ``` --- ## Installing R packages * What you get after your first install is base R * extra functionality comes from add-ons available from developers * R makes it very easy to install packages from within R. For example, type this in console ```r install.packages("tidyverse") install.packages("ggplot2") install.packages("dslabs") ``` After we install the package, we can then load the package into our R sessions using the library function: ```r library(tidyverse) library(dslabs) ``` If you want to use the add-on functions in the package, you need to library the package first. ??? When we first install R, we get a base set of functionality that includes the core functions and data types. However, additional functionality can be added to R through packages developed by third-party developers. R makes it very easy to install packages from within R using the install.packages() function. In the example code shown on the slide, we install three popular packages called tidyverse, ggplot2, and dslabs. Once we have installed a package, we need to load it into our R session using the library() function. This makes the functions and data types in the package available for use. In the example code, we load the tidyverse and dslabs packages using the library() function. It's important to note that if we want to use the functions and data types in a package, we need to load the package using the library() function first. Otherwise, we will get an error message indicating that the functions and data types are not found. --- ## Functions introduction R has a large collection of built-in functions that are called like this: ```r function_name(arg1 = val1, arg2 = val2, ...) ``` .panelset[ .panel[.panel-name[1] * The data analysis process are a series of **functions** applied to the data. * We also used the function `sqrt` to solve the quadratic equation. * Prebuilt R functions **do not** appear in the workspace because you did not define them, but they are available for immediate use. ] .panel[.panel-name[2] In general, we need to use **parentheses** to evaluate a function. If you type `ls`, the function is not evaluated and instead R shows you the code that defines the function. ```r ls ``` ] .panel[.panel-name[3] If you type `ls()` the function is evaluated and then we see objects in the workspace. ```r ls() ``` ``` [1] "ages" "inches" "myNAvec" "names" "scores" [6] "search_terms" "x" "y" ``` ] .panel[.panel-name[4] Unlike `ls`, most functions require one or more __arguments__. Below is an example of how we assign an object to the argument of the function `log`. Remember that we earlier defined `a` to be 1: ```r a=1 log(8) ``` ``` [1] 2.079442 ``` ```r log(a) ``` ``` [1] 0 ``` ] .panel[.panel-name[5] However, some arguments are required and others are optional. You can determine which arguments are optional by noting in the help document that a default value is assigned with `=`. Defining these is optional. For example, the base of the function `log` defaults to `base = exp(1)` making `log` the natural log by default. You can change the default values by simply assigning another object: ```r log(8, base = 2) ``` ``` [1] 3 ``` ] .panel[.panel-name[6] Note that we have not been specifying the argument `x` as such: ```r log(x = 8, base = 2) ``` ``` [1] 3 ``` The above code works, but we can save ourselves some typing: if no argument name is used, R assumes you are entering arguments in the order shown in the help file. ] .panel[.panel-name[7] So by not using the names, it assumes the arguments are `x` followed by `base`: ```r log(8,2) ``` ``` [1] 3 ``` If using the arguments' names, then we can include them in whatever order we want: ```r log(base = 2, x = 8) ``` ``` [1] 3 ``` To specify arguments, we must use `=`, and cannot use `<-`. ] ] --- ## R Markdown Revisit In RStudio, you can start an R markdown document by clicking on **File, New File, the R Markdown**. You will then be asked to enter a title and author for your document. You can also decide what format you would like the final report to be in: HTML, PDF, or Microsoft Word. It will generate a template file. As a convention, we use the **Rmd suffix** for these files. In the template, you will see several things to note. --- # The YAML header At the top you see: ``` --- title: "Untitled" author: 'Jingwei Xiong' date: "2023/1/10" output: html_document --- ``` The things between the `---` is the header. We actually don't need a header, but it is often useful. You can define many other things in the header than what is included in the template. We don't discuss those here, but much information is available online. The one parameter that we will highlight is `output`. By changing this to, say, `pdf_document`, we can control the type of output that is produced when we compile. --- # R code chunks In various places in the document, we see something like this: ```` ```{r} summary(pressure) ``` ```` These are the code chunks. When you compile the document, the R code inside the chunk, in this case `summary(pressure)`, will be evaluated and the result included in that position in the final document. This applies to plots as well; the plot will be placed in that position. We can write something like this: ```` ```{r} plot(pressure) ``` ```` --- By default, the code will show up as well. To avoid having the code show up, you can use an argument. To avoid this, you can use the argument `echo=FALSE`. For example: ```` ```{r, echo=FALSE} summary(pressure) ``` ```` If you want to only show the code but not run the code, you can use the argument `eval=FALSE`. ```` ```{r, eval=FALSE} summary(pressure) ``` ```` By default, the code will run and the output will be shown. --- # Knit your first rmd file After you get the template rmd file, click this **knit**: <img src="1.png" width="70%" height="70%"> This button will process your source code into the final document, if your code has no error. --- # Insert a new code chunk To insert a new code chunk, click this: <img src="2.png" width="70%" height="70%"> This button will insert a new code chunk in the current cursor line. --- # Run scripts in the code chunk To run scripts in a code chunk, click this: <img src="4.png" width="70%" height="70%"> This button will copy all of the codes inside that code chunk into the console, and run it. --- # Knit settings You can find the knit settings here: <img src="3.png" width="50%" height="50%"> You can change to word using the output format. --- ## Revisit: R markdown and console environment The environment of your R Markdown document is separate from the Console! .pull-left[ First, run the following in the console .small[ ```r x <- 2 x * 3 ``` ] ] -- .pull-right[ Then, add the following in an R chunk in your R Markdown document and try to knit it. .small[ ```r x * 3 ``` ] .question[ What happens? Why the error? ] ] ??? When adding the x * 3 command to an R chunk in the R Markdown document, an error will occur because the variable x was not defined within the R markdown environment. The R Markdown environment is separate from the console environment, so any variables created or functions defined in the console will not carry over to the R Markdown document unless they are specifically included or imported. --- ### Explanation: > When adding the `x * 3` command to an R chunk in the R Markdown document, an error will occur because the variable x was not defined within the R markdown environment. The R Markdown environment is separate from the console environment, so any variables created or functions defined in the console will **not carry over to the R Markdown document** unless they are specifically included or imported. --- ## Example on Rmarkdown: How to do the homework Create new code chunks; Run codes; Write your own response. > Remember, when you make sure the code is correct when running in the console, put them into the code chunk for that problem. > If you run into an object not found problem when knitting the document, it is because you did not include the variable definition in the RMD file, but you defined it in your console. --- # Readings - R for Data Science Chapter 20, 27 - Additional reading: Matloff Chapter 2 - [Chapter 2:R basics](http://rafalab.dfci.harvard.edu/dsbook/r-basics.html) - [R markdown tutorial](https://rmarkdown.rstudio.com/lesson-1.html)