## 1.2 Introduction to R

R is a powerful language that is widely-used for data analysis and statistical computing. It was developed in the early 90s by Ross Ihaka and Robert Gentleman, as a successor to S, a statistical programming language.

The inclusion of sophisticated packages (such as dplyr, tidyr, readr, data.table, SparkR, ggplot2, etc.) has made R both more powerful and more useful, allowing for smart data manipulation, visualization, and computation, using its built-in data structures and functionality.

Notably, it has gained prominence as a free and open source alternative to expensive statistical software

### 1.2.1 Why Use R?

Here are some benefits that potential users might note:

• the style of coding is intuitive;

• R is open source and free;

• more than 18,500 packages, customized for various computation tasks, are available (as of February 2022);

• the R community is overwhelmingly welcoming and useful to new users and experienced users alike (you can browse and ask questions at StackOverflow, and consult worked-out examples on R-bloggers, for instance);

• high performance computing experience is possible (with the appropriate packages), and

• is is one of the highly sought skills by analytics and data science companies.

### 1.2.2 Installing R / RStudio

Note: If you have a pre-existing installation of R and/or RStudio, you may skip this part. However, we highly recommend that both of these applications be upgraded to the most recent version, if they have not been upgraded for a while.

Consult Section 1.2.5 for details.

Data analysis can be conducted using only the vanilla (base) version of R, but the addition of RStudio provides a much better coding experience, in our opinion.

The following steps will allow you to install R and RStudio.

1. Download and install R at https://cloud.r-project.org/.

• Windows users should click on Download R for Windows, then click on base, then click on the Download R X.X.X for Windows link, where R X.X.X is the version number. For example, the latest version of R as of 2022-02-07, was R 4.1.2;

• macOS users should click on Download R for macOS, then on R-X.X.X.pkg (under “Latest release::”), where R-X.X.X is the version number. If the Mac has an Arm-based M1 chip, choose R-X.X.X-arm64.pkg instead;

• Linux users should click on Download R for Linux and choose the specific distribution for more information on installing R for their actual setup.

• look for the big blue button that says DOWNLOAD RSTUDIO FOR …, where ... represents the desired operating system;

• save the .dmg file, double-click it to open, and follow the installation instructions (you may need to restart your computer).

• Reminder: you will need to re-install XQuartz when upgrading your macOS to a new major version.

4. Even with both R and RStudio installed, we will refrain from working directly with the R interface, given that RStudio provides such a “nice” shell over the engine that is R.

Once RStudio is opened, the graphic user interface (GUI) displays 4 panes, as in Figure 1.5.

• Console: bottom left; this area shows the output of code that has been run (either from the command line in the console or from the script window);

• Script: top left; as the name suggests, this is the area one would typically use to write code. Lines can be run by first selecting them (right-clicking) and pressing ctrl + enter (win) or cmd + enter (mac) simultaneously. Alternatively, you can click on the little ‘Run’ button located at the top right corner of the script window;

• Environment: top right; this space displays the set of external elements that have been added. This includes data set, variables, vectors, functions etc. This area allows the user to verify that data has been loaded properly;

• Graphical Output: bottom right; this space display the graphs created during exploratory data analysis, or embedded help on package functions from R’s official documentation.

### 1.2.3 Test, Test, Test!

To make sure you have installed both R and RStudio properly, type a simple command in the console. For example, place your cursor in the pane labelled Console, type x <- 2 + 2 at the prompt, followed by enter or return, then type x, again followed by enter or return.

x <- 2 + 2
x
[1] 4

You should see the value 4 printed to the screen.

### 1.2.4 Customizing RStudio

We would like to suggest the following settings for your R/RStudio installation.

• In RStudio, go to Tools >> Global Options, and make the changes described in Figure 1.6.

[These settings] will cause you some short-term pain, because now when you restart RStudio it will not remember the results of the code that you ran last time. But this short-term pain will save you long-term agony because it forces you to capture all important interactions in your source code. There’s nothing worse than discovering three months after the fact that you’ve only stored the results of an important calculationin your workspace, not the calculation itself in your source code. [1]

Optionally, you could also adjust the font size via Tools >> Global Options >> Appearance >> Editor font size. By default, it is set at 12, but a larger font size may be easier on the eyes.

### 1.2.5 Upgrading R and/or RStudio

We suggest always working with the latest version of R and RStudio.

• To upgrade R, find out the current version of R running on your computer. You can do so from within RStudio, by typing R.version.string in the console. The output should look like this:

[1] "R version 4.1.2 (2020-11-01)"

As of November 2021, the latest version of R is 4.1.2.

If you have an older version installed on your computer, go to https://cloud.r-project.org and follow the steps described in @ref(#PPrimer-RI-irrs) to install the latest version of R.

You can confirm that the upgrade was successful by restarting RStudio and typing R.version.string in the console again.

• To upgrade RStudio from within RStudio, go to Help > Check for Updates to install newer version of RStudio (if available). Once both R and RStudio have been upgraded, test by typing some simple command in the console (e.g., @ref(#PPrimer-RI-ttt)).

### 1.2.6 Basics of R

How are the elements of code (introduced in Code Components) implemented in R? How do they mesh with one another to form interpretable code?

First, we should mention that while R is technically object-oriented, this tends to be hidden in practice; the language is thus especially well-suited for quick, interactive, and intuitive scripting and data exploration.

Note as well that it uses special built-in notation for statistical models, which would not usually be found in other languages (hence the “statistical programming” moniker). Some of the examples and explanations provided below are modified from .

The rest of this sections contain information on the basic use of R; more examples are available in Section 1.3 and throughout the course notes.

#### Simple Computations in R

We will get familiar with the R coding environment, we start by showing how the console can be used as an interactive calculator.

Type the first line of each group in your console, followed by a carriage return to confirm that R works as we would expect of a calculator:

2 + 3
[1] 5
(3*8)/(2*3)
[1] 4
log(12)
[1] 2.484907
sqrt(121)
[1] 11

You can experiment with various combinations of calculations.

Should you want to modify or repeat a prior calculation, press the Up Arrow when the cursor is in the console to cycle through previously executed commands; pressing Enter re-runs the selected computation.

On the other hand, you can avoid scrolling through a wall of computations by creating a variable. In R, this is done via the variable assignment symbols <- or =.9 Once a variable exists in memory, the output does not get printed directly unless it is called directly at the prompt, or if the variable assignment is surrounded with a pair of parentheses.

x <- 8 + 17
x
[1] 25
(y <- 8 + 17)
[1] 25

Variables can be named using any combination of alphanumeric symbols, but the name has to start with a letter (a-z, A-Z) and cannot contain spaces and punctuation marks other than periods and dashes.

#### R Packages

Packages (or libraries) contain pre-compiled functions and objects that could be useful in specific settings.

To install a package, simply type:

install.packages("package_name")

Take note of the quotation marks. You can type this code directly in the console, followed by a carriage return, or enter it in the script window and click Run in the menu at the top.

 KernSmooth MASS boot class foreign lattice mgcv nlme rpart spatial survival base grDevices graphics grid methods stats stats4 tcltk tools cluster nnet datasets splines

These packages implement standard statistical functionality, for example linear models, classical tests, a huge collection of high-level plotting functions, and tools for survival analysis.

Once a package is installed, it needs to be loaded before its objects (datasets, functions) can be used. This can be done by typing

libary(package_name)

as above (as entering instructions is always done in one of the ways described above, we will stop specifying where and how it must be done). Note the absence of the quotation marks.

For instance, in Section 1.1.3, we loaded the igraph package to take advantage of the pre-compiled functions sample_gnp(), ecount(), V(), and plot(). The first three functions are not available in the base distribution; the last function plot() does exist, but it would not know how to handle graph objects without the special instructions provided by igraph.

The help file for compiled functions can be displayed in the graphical output window by using the reserved character ?, as below.

?igraph::sample_gnp

In more sophisticated code, it is conceivable that we would want to load multiple libraries; because we might forget which function is associated with which library, or even that different libraries use the same name for different functions, it is becoming good practice to forego explicitly loading a library in favour of directly fetching the required functionality (the package must be installed first, however). In R, this is done as follows:

package_name::function_name(function_parameters)

For instance, the graph code from above can be replaced by the following chunk:

my_graph_function <- function(my_number_nodes, my_colour, my_density) {
my_graph = igraph::sample_gnp(my_number_nodes, my_density, directed=FALSE, loops = FALSE)
if(igraph::ecount(my_graph) >= my_number_nodes) {
igraph::V(my_graph)$color <- my_colour } plot(my_graph, vertex.color = igraph::V(my_graph)$color)
}

my_graph_function(30,"green",0.3)  

Note, however, that this is not always a good strategy (in particular, when using the pipeline operator).

#### R Essentials

Everything you see or create in R is an object: vectors, matrices, data frames, even variables (and functions) are objects.

R allows 5 basic classes of objects:

• Character

• Numeric (real numbers)

• Integer (whole numbers)

• Complex

• Logical (True / False)

Each of these classes has attributes. An object can have the following attributes:

• names, dimension names

• dimensions

• class

• length

• etc.

An object’s various attributes can be accessed using the attributes() function. We will have more to say on this topic.

The most basic R object is the vector. An empty vector can be created using vector(). A vector contains various objects, but all must be of the same class.10

Vectors can also often created using the combine (or concatenate) operator c() (which makes it a singularly bad idea to use c as a variable name).

(a <- c(1.8, 4.5))                   # numeric
(b <- c(1 + 2i, 3 - 6i))             # complex
(d <- c(23, 44))                     # integer
(e <- vector("logical", length = 5)) # logical
(f <- c("abc","def"))                # character
[1] 1.8 4.5
[1] 1+2i 3-6i
[1] 23 44
[1] FALSE FALSE FALSE FALSE FALSE
[1] "abc" "def"

Comments can be introduced in R code via the # symbol: all characters following a pound (or sharp) symbol are ignored by R until the next line of code (so the classes above would not be part of the code proper).

#### R Data Types and Objects

Let us look at some of the various types of R objects.

##### Vectors

As mentioned above, a vector contains objects of the same class. We may have a need to mix objects of different classes in a list – this can be done to a vector by coercion. This has the effect of ‘converting’ objects of different types to the same class. For instance:

(vec <- c("Time", 25,TRUE,"retro", 2.22))     # coercion to character
(bbb <- c(FALSE, 11))                         # coercion to numeric
(i.a <- c(215,"October"))                     # coercion to character
[1] "Time"  "25"    "TRUE"  "retro" "2.22"
[1]  0 11
[1] "215"     "October"

We can verify the class of these objects using the class() function.

class(vec)
class(bbb)
class(i.a)
[1] "character"
[1] "numeric"
[1] "character"

To convert the class of a vector, we can use the as. command.

g <- 10:16      # create a vector of 7 integers
class(g)        # find bar's class
as.numeric(g)   # convert to numeric
class(g)
as.character(g) # convert to character
class(g)
[1] "integer"
[1] 10 11 12 13 14 15 16
[1] "integer"
[1] "10" "11" "12" "13" "14" "15" "16"
[1] "integer"

We can change the class of any vector using a similar approach. But be careful – while we can convert a numeric vector into a character one, going the other way will introduce NAs (conversion is subject to R’s internal class rules).

##### Lists

A list is a special type of object which can contain elements of different data types.

my.list <- list(254,"abab", TRUE, 0 - 3i)
my.list
[[1]]
[1] 254

[[2]]
[1] "abab"

[[3]]
[1] TRUE

[[4]]
[1] 0-3i

The output of a list differs from that of a vector, since all the objects are of different types. The double bracket [[1]] shows the index of the first element and so on. The elements of a list can be extracted by using the appropriate index:

my.list[[3]]
[1] TRUE

The single single bracket [ ] also has a role: it returns the list element with its index number, instead of the result above.

my.list[3]
[[1]]
[1] TRUE
##### Matrices

A vector for which rows and columns are explicitly identified is a matrix, a 2-dimensional data structure. All the entries of a matrix have to be of the same class. The following code produces a 6 by 3 matrix consisting of the first 18 integers.

my.matrix <- matrix(1:18, nrow=6, ncol=3)
my.matrix
     [,1] [,2] [,3]
[1,]    1    7   13
[2,]    2    8   14
[3,]    3    9   15
[4,]    4   10   16
[5,]    5   11   17
[6,]    6   12   18

The dimensions of a matrix can be obtained using either the dim() or attributes() commands (the matrix dimensions are a matrix’s only attributes in R).

dim(my.matrix)
attributes(my.matrix)
[1] 6 3
age : num 45 41 19 8 5 [1] 5 [1] 2 In the code above, df is the name of data frame, dim() returns its dimensions, str() its structure (i.e. the list of variables stored in the data frame), and nrow() and ncol(), the number of rows and number of columns in the data frame, respectively. ##### Exercises 1. Calculate the following quantities: • The sum of 1.001, 22.9, and -73.78 • The square root of 64 • Calculate the base 10 logarithm of 90, and multiply the result with the cosine of $$\pi$$. Hint: see ?log and ?pi for information about how to use . 2. Type the following code, which assigns numbers to objects x and y. x<-252 y<-5.5 • Calculate the product of x and y • Store the result in a new object called z • Inspect your workspace by typing ls(), and by clicking the Environment tab in Rstudio, and find the three objects you created - Make a vector of the objects x, y, and z.  1. You have measured seven cylinders. Their lengths are: 2.1, 10.8, 5.5, 6.6, 9.7, 8.2, 8.1, and the diameters are: 0.4, 0.3, 1.2, 0.9, 0.3, 0.2, 0.1. Read these data points into two vectors (give the vectors appropriate names). Calculate the volume of each cylinder ($$V=\text{length}\times \pi \times (\text{diameter}/2)^2$$). 2. Input the following data, related to space shuttle launch damage prior to the Challenger explosion. The set covers 6 launches out of 24 that were included in the pre-launch charts used to decide whether to proceed with the launch or not  Temp Erosion Blowby Total 53 3 2 5 57 1 0 1 63 1 0 1 70 1 0 1 70 1 0 1 75 0 2 1 Enter these data into a data frame, with (for example) column names temperature, erosion, blowby, and total. #### Reading Data and Writing Reading data into a statistical system for analysis, and exporting the results to some other system for report writing, can be frustrating tasks that take far more time than the statistical/data analysis itself, but the former task is required if the latter is to be undertaken in earnest. This section describes the import and export facilities available either in R itself or via packages available from Comprehensive R Archive Network (CRAN). R comes with a few data reading functions: • read.table(), read.csv() for tabular data; • readLines() for lines of a text file; • source(), dget() to read R code files (inverse of dump() and dput(), respectively); • load() to read-in saved workspaces; • unserialize() to read single R objects in binary form. There are, of course, numerous R packages that have been developed to read in all kinds of other datasets, and you may need to resort to one of these packages if you are working in a specific area. ##### read.table() The read.table() function is one of the most commonly-used functions for reading data. The help file is worth reading (run ?read.table in the console) in its entirety if only because the function gets so much use. Its main arguments are: • file, the name of a file, or a connection; • header, logical indicating if the file has a header line; • sep, string indicating how the columns are separated; • colClasses, character vector indicating the class of each column in the dataset; • nrows, number of rows in the dataset (by default read.table() will read the entire file); • comment.char, character string indicating the comment character (this defaults to “#”); • skip, the number of lines to skip from the beginning of the file, and • stringsAsFactors, should character variables be coded as factors? (this defaults to TRUE because back in the old days, strings represented levels of a categorical variable; now that text mining is an every day occurrence, that is not always the case). For small to moderately sized datasets, you can usually callread.table() without specifying any other arguments data <- read.table("foo.txt") In this case, R will read in the file foo.txt an automatically: • skip lines that begin with a #; • figure out how many rows there are (and how much memory needs to be allocated), and • figure what type of variable is in each column of the table. Telling R all these things directly makes R run faster and more efficiently. The read.csv() function is identical to read.table() except that some of the defaults are set differently (such as the sep argument). With much larger datasets, there are a few things that can be done to prevent R from choking on the data (always a risk as R stores everything in RAM): • read the help page for read.table(), which contains many hints; • make a rough calculation of the memory required to store the dataset (see the next Section for an example of how to do this); if the dataset is larger than the amount of RAM on your computer, best to stop here; • set comment.char = "" if there are no commented lines in your file; • use the colClasses argument; specifying this option can make read.table() run MUCH faster, often twice as fast (in order to use this option, we must know the class of each column in the data frame; if all of the columns are “numeric”, for example, then we would simply set colClasses = "numeric"). A quick way to figure out the classes of each column is to use the following code: initial <- read.table("datatable.txt", nrows = 100) classes <- sapply(initial, class) tabAll <- read.table("datatable.txt", colClasses = classes)] • set nrows – this doesn’t make R run faster but it helps with memory usage (a mild overestimate is okay; the Unix tool wc can be used to calculate the number of lines in the file). In general, when using R with larger datasets, it is also useful to know a few things about the operating system: • how much memory is available on the system? • what other applications are in use? (close everything that is unnecessary) • are other users logged into the same system? • what is the operating system? (some operating systems can limit the amount of memory a single process can access). For example, suppose we have a data frame with 2,000,000 rows and 100 columns, all of which are numeric data. Roughly speaking, how much memory is required to store this data frame? On most modern computers, numeric data is stored using 64 bits of memory (8 bytes). Given that information, you can perform the following calculation: \begin{aligned} 2,000,000 \times 100 \times 8 \text{ bytes} &= 1,600,000,000 \text{ bytes} &\\ &\approx 1,600 \text{ MB} &\\ &= 1.6 \text{ GB.}& \end{aligned} Reading in a large dataset for which one does not have enough RAM is an easy way to get the computer (or at the very least, the R session) to freeze. This is usually an unpleasant experience that requires killing the R process, in the best case scenario, or rebooting the computer, in the worst case. It is always a good idea to do a rough memory requirements calculation before reading in a large dataset. ##### txt, csv, and Other Formats • Fixed format text files df = read.table("dir_location\\file.txt", header=TRUE) # Windows only df = read.table("dir_location/file.txt", header=TRUE) # all OS (including # Windows) The forward slash / is supported as a directory delimiter on all operating systems; the double backslash \\ is only supported under Windows. If the first row of the file includes the name of the variables, these entries will be used to create appropriate names (reserved characters such as ‘’ are changed to ‘.’) for each of the columns in the dataset.

If the first row does not include the names, the header option can be left off (or set to FALSE), and the variables will be named V1, V2, …, Vn.

A limit on the number of lines to be read can be specified through the nrows option The read.table() function can support reading from a URL as a filename or browse files interactively using read.table(file.choose()).

Sometimes data arrives in irregularly-shaped data files (there may be a variable number of fields per line, or some data in the line may describe the remainder of the line). In such cases, a useful generic approach is to read each line into a single character variable, then use character variable functions to extract the contents.

df = readLines("file.txt")
df = scan("file.txt")

The readLines() function returns a character vector with length equal to the number of lines read. A limit on the number of lines to be read can be specified through the nrows option. The scan() function returns a vector, with entries separated by white space by default. These functions read by default from standard input, but can also read from a file or URL.

• Comma-separated value (CSV) files

The read.csv() function takes on much the same parameters as read.table().

df = read.csv("dir_location/file.csv")
• Read sheets from an Excel file

If the data is available in an Excel file, various possibilities exist, depending on the spreadsheet format.

df.xls = gdata::read.xls("file.xls", sheet=1)
df.xlsx = xlsx::read.xlsx("file.xlsx", sheet=1)

The sheet can be provided as either a number or a name (make sure that the appropriate packages have been installed beforehand, however).

• Reading datasets in other formats

The datasets of interest sometimes comes from another software. The foreign library is able to do a native import for some of the most common formats: Stata, Epi Info, Minitab, Octave, SPSS, Systat, and SAS files.11

df = foreign::read.dbf("filename.dbf")         # DBase
df = foreign::read.epiinfo("filename.epiinfo") # Epi Info
df = foreign::read.mtp("filename.mtp")         # Minitab portable worksheet
df = read.xport("filename.xport")              # SAS XPORT file
df = read.systat("filename.sys")               # Systat

There are analogous functions for writing data to files:

• write.table() writes tabular data to text files (i.e. CSV);

• writeLines(), to write character data line-by-line to a file;

• dump(), for dumping a textual representation of multiple R objects;

• dput(), for outputting a textual representation of an R object;

• save(), for saving an arbitrary number of R objects in binary format (possibly compressed) to a file, and

• serialize(), for converting an R object into a binary format for outputting to a file.

There are numerous ways to store data, including structured text file formats like CSV or tab-delimited, or more complex binary formats. Take the time to explore the functionality so that you can achieve your specific aims.

##### Exercises
1. Read the following data into R (number of honeyeaters seen at the EucFACE site in a week). Give the resulting data frame a reasonable name. Type it into Excel or text file and save it as a CSV file or txt.

 Day nrbirds Day nrbirds Sunday 3 Thursday 8 Monday 2 Friday 1 Tuesday 5 Saturday 2 Wednesday 0

Enter the following data as new observations of a different week starting on Sunday: 4, 3, 6, 1, 9, 2, 0.

1. Read the data from the space shuttle launch (from the previous section) data into R.

2. Read the following data set (various Australian populations since 1917) into an R object. Write the object into a text file, from R.

 Year NSW Vic. Qld SA WA Tas. NT ACT Aust. 1917 1904 1409 683 440 306 193 5 3 4941 1927 2402 1727 873 565 392 211 4 8 6182 1937 2693 1853 993 589 457 233 6 11 6836 1947 2985 2055 1106 646 502 257 11 17 7579 1957 3625 2656 1413 873 688 326 21 38 9640 1967 4295 3274 1700 1110 879 375 62 103 11799 1977 5002 3837 2130 1286 1204 415 104 214 14192 1987 5617 4210 2675 1393 1496 449 158 265 16264 1997 6274 4605 3401 1480 1798 474 187 310 18532

### References

[1]
H. Wickham and G. Grolemund, R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, 2017.
[10]
R. Kabacoff, R in Action, Second. Manning, 2015.
[15]
Karthe, Analytics Vidhya, 2016.