1.3 More About Programming in R

Many software packages and libraries are available to the data analyst. R not only has the advantage that we can easily use its available packages, but it provides enough flexibility for the analyst who wants to get dirty with the data.

In this section, you will find examples and tips that highlight R’s data manipulation features. It is not meant to be a complete introduction, or even necessarily a showcase of good programming practices.

1.3.1 Help and Documentation

R’s various help files and demos can be accessed using the following commands (where function_name and search_term correspond to the desired function and/or term):

  • ?function_name

  • example(function_name)

  • args(function_name)

  • ??search_term

For instance, the following code would display the help file for the function glm() in the bottom graphical output window of RStudio:

?glm 

Most help files contain examples showcasing the use of the function. These can be accessed via example().

example(glm)

We can thus copy the code from the example file, and run it directly at the console.

counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
print(d.AD <- data.frame(treatment, outcome, counts))
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
anova(glm.D93)
summary(glm.D93)
  treatment outcome counts
1         1       1     18
2         1       2     17
3         1       3     15
4         2       1     20
5         2       2     10
6         2       3     20
7         3       1     25
8         3       2     13
9         3       3     12
Analysis of Deviance Table

Model: poisson, link: log

Response: counts

Terms added sequentially (first to last)

          Df Deviance Resid. Df Resid. Dev
NULL                          8    10.5814
outcome    2   5.4523         6     5.1291
treatment  2   0.0000         4     5.1291

Call:
glm(formula = counts ~ outcome + treatment, family = poisson())

Deviance Residuals: 
       1         2         3         4         5         6         7         8  
-0.67125   0.96272  -0.16965  -0.21999  -0.95552   1.04939   0.84715  -0.09167  
       9  
-0.96656  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.045e+00  1.709e-01  17.815   <2e-16 ***
outcome2    -4.543e-01  2.022e-01  -2.247   0.0246 *  
outcome3    -2.930e-01  1.927e-01  -1.520   0.1285    
treatment2   1.338e-15  2.000e-01   0.000   1.0000    
treatment3   1.421e-15  2.000e-01   0.000   1.0000    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 10.5814  on 8  degrees of freedom
Residual deviance:  5.1291  on 4  degrees of freedom
AIC: 56.761

Number of Fisher Scoring iterations: 4

Similarly, the function’s arguments can be accessed via args().

args(glm)
function (formula, family = gaussian, data, weights, subset, 
    na.action, start = NULL, etastart, mustart, offset, control = list(...), 
    model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, singular.ok = TRUE, 
    contrasts = NULL, ...) 
NULL

1.3.2 Simple Data Manipulation

So what can we actually do with R?

Loading a Built-In Dataset

We can obtain a list of such datasets in the datsets package by calling the following function:

data()

Or those available in all installed packages via:

data(package = .packages(all.available = TRUE)) 

Let us take a look at the swiss built in dataset (type ?swiss to see the help file). We can display the dataset by simply calling it at the prompt, like so:

swiss
             Fertility Agriculture Examination Education Catholic
Courtelary        80.2        17.0          15        12     9.96
Delemont          83.1        45.1           6         9    84.84
Franches-Mnt      92.5        39.7           5         5    93.40
Moutier           85.8        36.5          12         7    33.77
Neuveville        76.9        43.5          17        15     5.16
Porrentruy        76.1        35.3           9         7    90.57
Broye             83.8        70.2          16         7    92.85
Glane             92.4        67.8          14         8    97.16
Gruyere           82.4        53.3          12         7    97.67
Sarine            82.9        45.2          16        13    91.38
Veveyse           87.1        64.5          14         6    98.61
Aigle             64.1        62.0          21        12     8.52
Aubonne           66.9        67.5          14         7     2.27
Avenches          68.9        60.7          19        12     4.43
Cossonay          61.7        69.3          22         5     2.82
Echallens         68.3        72.6          18         2    24.20
Grandson          71.7        34.0          17         8     3.30
Lausanne          55.7        19.4          26        28    12.11
La Vallee         54.3        15.2          31        20     2.15
Lavaux            65.1        73.0          19         9     2.84
Morges            65.5        59.8          22        10     5.23
Moudon            65.0        55.1          14         3     4.52
Nyone             56.6        50.9          22        12    15.14
Orbe              57.4        54.1          20         6     4.20
Oron              72.5        71.2          12         1     2.40
Payerne           74.2        58.1          14         8     5.23
Paysd'enhaut      72.0        63.5           6         3     2.56
Rolle             60.5        60.8          16        10     7.72
Vevey             58.3        26.8          25        19    18.46
Yverdon           65.4        49.5          15         8     6.10
Conthey           75.5        85.9           3         2    99.71
Entremont         69.3        84.9           7         6    99.68
Herens            77.3        89.7           5         2   100.00
Martigwy          70.5        78.2          12         6    98.96
Monthey           79.4        64.9           7         3    98.22
St Maurice        65.0        75.9           9         9    99.06
Sierre            92.2        84.6           3         3    99.46
Sion              79.3        63.1          13        13    96.83
Boudry            70.4        38.4          26        12     5.62
La Chauxdfnd      65.7         7.7          29        11    13.79
Le Locle          72.7        16.7          22        13    11.22
Neuchatel         64.4        17.6          35        32    16.92
Val de Ruz        77.6        37.6          15         7     4.97
ValdeTravers      67.6        18.7          25         7     8.65
V. De Geneve      35.0         1.2          37        53    42.34
Rive Droite       44.7        46.6          16        29    50.43
Rive Gauche       42.8        27.7          22        29    58.33
             Infant.Mortality
Courtelary               22.2
Delemont                 22.2
Franches-Mnt             20.2
Moutier                  20.3
Neuveville               20.6
Porrentruy               26.6
Broye                    23.6
Glane                    24.9
Gruyere                  21.0
Sarine                   24.4
Veveyse                  24.5
Aigle                    16.5
Aubonne                  19.1
Avenches                 22.7
Cossonay                 18.7
Echallens                21.2
Grandson                 20.0
Lausanne                 20.2
La Vallee                10.8
Lavaux                   20.0
Morges                   18.0
Moudon                   22.4
Nyone                    16.7
Orbe                     15.3
Oron                     21.0
Payerne                  23.8
Paysd'enhaut             18.0
Rolle                    16.3
Vevey                    20.9
Yverdon                  22.5
Conthey                  15.1
Entremont                19.8
Herens                   18.3
Martigwy                 19.4
Monthey                  20.2
St Maurice               17.8
Sierre                   16.3
Sion                     18.1
Boudry                   20.3
La Chauxdfnd             20.5
Le Locle                 18.9
Neuchatel                23.0
Val de Ruz               20.0
ValdeTravers             19.5
V. De Geneve             18.0
Rive Droite              18.2
Rive Gauche              19.3

Or we can take a look at its first or last n entries using the functions head() or tail().

head(swiss,6)
             Fertility Agriculture Examination Education Catholic
Courtelary        80.2        17.0          15        12     9.96
Delemont          83.1        45.1           6         9    84.84
Franches-Mnt      92.5        39.7           5         5    93.40
Moutier           85.8        36.5          12         7    33.77
Neuveville        76.9        43.5          17        15     5.16
Porrentruy        76.1        35.3           9         7    90.57
             Infant.Mortality
Courtelary               22.2
Delemont                 22.2
Franches-Mnt             20.2
Moutier                  20.3
Neuveville               20.6
Porrentruy               26.6

Assigning Data

We can create, assign, and display a vector consisting of a sequence of numbers like this:

(x<- c(1:3))  
[1] 1 2 3

We can also assign non-sequential numbers:

(w <- c(12,-9))   
[1] 12 -9

or mixed objects:

(v = c(w,"pamplemousse"))   
[1] "12"           "-9"           "pamplemousse"

or matrices:

(u = t(matrix(1:10,ncol=5)))
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8
[5,]    9   10

Data Types and Conversion

We can test whether objects are of a certain type or class:

is.numeric(x)
[1] TRUE
is.character(x)
[1] FALSE
is.vector(x)
[1] TRUE
is.matrix(x)
[1] FALSE
is.data.frame(x)
[1] FALSE
is.character(w)
[1] FALSE
is.character(v)
[1] TRUE
is.data.frame(swiss)
[1] TRUE

We can also set an object to be of a specific type:

as.numeric(x)
[1] 1 2 3
as.character(x)
[1] "1" "2" "3"
as.vector(x)
[1] 1 2 3
as.matrix(x)
     [,1]
[1,]    1
[2,]    2
[3,]    3
as.data.frame(x)
  x
1 1
2 2
3 3

Or combine two vectors into a single vector:

c(y,w)
[1] 200 300 400 500 600 700  12  -9

Or convert vectors to matrices or data frames:

cbind(x,y)
     x   y
[1,] 1 200
[2,] 2 300
[3,] 3 400
[4,] 1 500
[5,] 2 600
[6,] 3 700
rbind(x,y)
  [,1] [,2] [,3] [,4] [,5] [,6]
x    1    2    3    1    2    3
y  200  300  400  500  600  700
data.frame(x,y)
  x   y
1 1 200
2 2 300
3 3 400
4 1 500
5 2 600
6 3 700

Conversely, we can convert a matrix to a vector:

as.vector(u)
 [1]  1  3  5  7  9  2  4  6  8 10

or a matrix to a data frame:

as.data.frame(u)
  V1 V2
1  1  2
2  3  4
3  5  6
4  7  8
5  9 10

or a data frame to a matrix:

             Fertility Agriculture Examination Education Catholic
Courtelary        80.2        17.0          15        12     9.96
Delemont          83.1        45.1           6         9    84.84
Franches-Mnt      92.5        39.7           5         5    93.40
Moutier           85.8        36.5          12         7    33.77
Neuveville        76.9        43.5          17        15     5.16
Porrentruy        76.1        35.3           9         7    90.57
Broye             83.8        70.2          16         7    92.85
Glane             92.4        67.8          14         8    97.16
Gruyere           82.4        53.3          12         7    97.67
Sarine            82.9        45.2          16        13    91.38
Veveyse           87.1        64.5          14         6    98.61
Aigle             64.1        62.0          21        12     8.52
Aubonne           66.9        67.5          14         7     2.27
Avenches          68.9        60.7          19        12     4.43
Cossonay          61.7        69.3          22         5     2.82
Echallens         68.3        72.6          18         2    24.20
Grandson          71.7        34.0          17         8     3.30
Lausanne          55.7        19.4          26        28    12.11
La Vallee         54.3        15.2          31        20     2.15
Lavaux            65.1        73.0          19         9     2.84
Morges            65.5        59.8          22        10     5.23
Moudon            65.0        55.1          14         3     4.52
Nyone             56.6        50.9          22        12    15.14
Orbe              57.4        54.1          20         6     4.20
Oron              72.5        71.2          12         1     2.40
Payerne           74.2        58.1          14         8     5.23
Paysd'enhaut      72.0        63.5           6         3     2.56
Rolle             60.5        60.8          16        10     7.72
Vevey             58.3        26.8          25        19    18.46
Yverdon           65.4        49.5          15         8     6.10
Conthey           75.5        85.9           3         2    99.71
Entremont         69.3        84.9           7         6    99.68
Herens            77.3        89.7           5         2   100.00
Martigwy          70.5        78.2          12         6    98.96
Monthey           79.4        64.9           7         3    98.22
St Maurice        65.0        75.9           9         9    99.06
Sierre            92.2        84.6           3         3    99.46
Sion              79.3        63.1          13        13    96.83
Boudry            70.4        38.4          26        12     5.62
La Chauxdfnd      65.7         7.7          29        11    13.79
Le Locle          72.7        16.7          22        13    11.22
Neuchatel         64.4        17.6          35        32    16.92
Val de Ruz        77.6        37.6          15         7     4.97
ValdeTravers      67.6        18.7          25         7     8.65
V. De Geneve      35.0         1.2          37        53    42.34
Rive Droite       44.7        46.6          16        29    50.43
Rive Gauche       42.8        27.7          22        29    58.33
             Infant.Mortality
Courtelary               22.2
Delemont                 22.2
Franches-Mnt             20.2
Moutier                  20.3
Neuveville               20.6
Porrentruy               26.6
Broye                    23.6
Glane                    24.9
Gruyere                  21.0
Sarine                   24.4
Veveyse                  24.5
Aigle                    16.5
Aubonne                  19.1
Avenches                 22.7
Cossonay                 18.7
Echallens                21.2
Grandson                 20.0
Lausanne                 20.2
La Vallee                10.8
Lavaux                   20.0
Morges                   18.0
Moudon                   22.4
Nyone                    16.7
Orbe                     15.3
Oron                     21.0
Payerne                  23.8
Paysd'enhaut             18.0
Rolle                    16.3
Vevey                    20.9
Yverdon                  22.5
Conthey                  15.1
Entremont                19.8
Herens                   18.3
Martigwy                 19.4
Monthey                  20.2
St Maurice               17.8
Sierre                   16.3
Sion                     18.1
Boudry                   20.3
La Chauxdfnd             20.5
Le Locle                 18.9
Neuchatel                23.0
Val de Ruz               20.0
ValdeTravers             19.5
V. De Geneve             18.0
Rive Droite              18.2
Rive Gauche              19.3

Writing Function

One of R’s most advantageous feature is its flexibility: what if we want to write our own functions? The template for all functions is a block of code that looks like:

my.function <- function(arg1,arg2, ..., argn) {
     # what my.function does, typically involving the arguments
}

Here are some (truly) simple examples: here is a function, my.product(), that computes the product of two arguments \(x\) and \(y\).12

my.product <- function (x,y) {
    x*y
}

Note that the function definition must be compiled (the code must be run) before it can be called in the code.

There are multiple ways to call my.product() for arguments x=12 and y=-2.

my.product(x=12,y=-2)
my.product(y=-2,x=12)
my.product(12,-2)
my.product(-2,12) 
[1] -24
[1] -24
[1] -24
[1] -24

The first two calls reflect better programming practices. The last of those is acceptable because multiplication is commutative, but it is risky to play with the arguments this way.

For instance, consider another simple function my.quotient():

my.quotient <- function (x,y) {
    x/y
}

We call my.quotient() on x=12 and y=-2.

my.quotient(x=12,y=-2)
my.quotient(y=-2,x=12)
my.quotient(12,-2)
[1] -6
[1] -6
[1] -6

but

my.quotient(-2,12)
[1] -0.1666667

When the parameters are not specified in the function call, their implied order reverts to the declared order in the definition (1st = \(x\), 2nd = \(y\)).

And what might we expect to happen with this call?

my.quotient(12,0)
[1] Inf

1.3.3 Exploring Data

R is good tool for data exploration. Let us examine the swiss dataset in detail.

We start by displaying the first few rows of the dataset (10, in this case):

head(swiss,10)
             Fertility Agriculture Examination Education Catholic
Courtelary        80.2        17.0          15        12     9.96
Delemont          83.1        45.1           6         9    84.84
Franches-Mnt      92.5        39.7           5         5    93.40
Moutier           85.8        36.5          12         7    33.77
Neuveville        76.9        43.5          17        15     5.16
Porrentruy        76.1        35.3           9         7    90.57
Broye             83.8        70.2          16         7    92.85
Glane             92.4        67.8          14         8    97.16
Gruyere           82.4        53.3          12         7    97.67
Sarine            82.9        45.2          16        13    91.38
             Infant.Mortality
Courtelary               22.2
Delemont                 22.2
Franches-Mnt             20.2
Moutier                  20.3
Neuveville               20.6
Porrentruy               26.6
Broye                    23.6
Glane                    24.9
Gruyere                  21.0
Sarine                   24.4

We could also display the last few entries (6, here):

tail(swiss,6)
             Fertility Agriculture Examination Education Catholic
Neuchatel         64.4        17.6          35        32    16.92
Val de Ruz        77.6        37.6          15         7     4.97
ValdeTravers      67.6        18.7          25         7     8.65
V. De Geneve      35.0         1.2          37        53    42.34
Rive Droite       44.7        46.6          16        29    50.43
Rive Gauche       42.8        27.7          22        29    58.33
             Infant.Mortality
Neuchatel                23.0
Val de Ruz               20.0
ValdeTravers             19.5
V. De Geneve             18.0
Rive Droite              18.2
Rive Gauche              19.3

We can also get an idea as to the dataset’s structure with str():

str(swiss)
'data.frame':   47 obs. of  6 variables:
 $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
 $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
 $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
 $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
 $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
 $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...

We can extract the column names with the function colnames():

colnames(swiss)
[1] "Fertility"        "Agriculture"      "Examination"      "Education"       
[5] "Catholic"         "Infant.Mortality"

or display a specific column of the data frame, say Education, with the $ operator:

swiss$Education   
 [1] 12  9  5  7 15  7  7  8  7 13  6 12  7 12  5  2  8 28 20  9 10  3 12  6  1
[26]  8  3 10 19  8  2  6  2  6  3  9  3 13 12 11 13 32  7  7 53 29 29

This cannot be done with a matrix, however:

swiss_matrix$Education

To extract the Education column from a matrix, identify its column index and use this, instead:

swiss_matrix[,4]
  Courtelary     Delemont Franches-Mnt      Moutier   Neuveville   Porrentruy 
          12            9            5            7           15            7 
       Broye        Glane      Gruyere       Sarine      Veveyse        Aigle 
           7            8            7           13            6           12 
     Aubonne     Avenches     Cossonay    Echallens     Grandson     Lausanne 
           7           12            5            2            8           28 
   La Vallee       Lavaux       Morges       Moudon        Nyone         Orbe 
          20            9           10            3           12            6 
        Oron      Payerne Paysd'enhaut        Rolle        Vevey      Yverdon 
           1            8            3           10           19            8 
     Conthey    Entremont       Herens     Martigwy      Monthey   St Maurice 
           2            6            2            6            3            9 
      Sierre         Sion       Boudry La Chauxdfnd     Le Locle    Neuchatel 
           3           13           12           11           13           32 
  Val de Ruz ValdeTravers V. De Geneve  Rive Droite  Rive Gauche 
           7            7           53           29           29 

Just as one would expect from the behaviour of colnames(), rownames() extracts the data frame’s row names:

rownames(swiss)   
 [1] "Courtelary"   "Delemont"     "Franches-Mnt" "Moutier"      "Neuveville"  
 [6] "Porrentruy"   "Broye"        "Glane"        "Gruyere"      "Sarine"      
[11] "Veveyse"      "Aigle"        "Aubonne"      "Avenches"     "Cossonay"    
[16] "Echallens"    "Grandson"     "Lausanne"     "La Vallee"    "Lavaux"      
[21] "Morges"       "Moudon"       "Nyone"        "Orbe"         "Oron"        
[26] "Payerne"      "Paysd'enhaut" "Rolle"        "Vevey"        "Yverdon"     
[31] "Conthey"      "Entremont"    "Herens"       "Martigwy"     "Monthey"     
[36] "St Maurice"   "Sierre"       "Sion"         "Boudry"       "La Chauxdfnd"
[41] "Le Locle"     "Neuchatel"    "Val de Ruz"   "ValdeTravers" "V. De Geneve"
[46] "Rive Droite"  "Rive Gauche" 

The summary statistics (5-pt summary + mean + number of missing variables for numerical variables; frequency table for others) can be obtained for all data frame’s variables simultaneously:

summary(swiss)   
   Fertility      Agriculture     Examination      Education    
 Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
 1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
 Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
 Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
 3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
 Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
    Catholic       Infant.Mortality
 Min.   :  2.150   Min.   :10.80   
 1st Qu.:  5.195   1st Qu.:18.15   
 Median : 15.140   Median :20.00   
 Mean   : 41.144   Mean   :19.94   
 3rd Qu.: 93.125   3rd Qu.:21.70   
 Max.   :100.000   Max.   :26.60   

More in-depth statistics are available with psych’s describe():

psych::describe(swiss)
                 vars  n  mean    sd median trimmed   mad   min   max range
Fertility           1 47 70.14 12.49  70.40   70.66 10.23 35.00  92.5 57.50
Agriculture         2 47 50.66 22.71  54.10   51.16 23.87  1.20  89.7 88.50
Examination         3 47 16.49  7.98  16.00   16.08  7.41  3.00  37.0 34.00
Education           4 47 10.98  9.62   8.00    9.38  5.93  1.00  53.0 52.00
Catholic            5 47 41.14 41.70  15.14   39.12 18.65  2.15 100.0 97.85
Infant.Mortality    6 47 19.94  2.91  20.00   19.98  2.82 10.80  26.6 15.80
                  skew kurtosis   se
Fertility        -0.46     0.26 1.82
Agriculture      -0.32    -0.89 3.31
Examination       0.45    -0.14 1.16
Education         2.27     6.14 1.40
Catholic          0.48    -1.67 6.08
Infant.Mortality -0.33     0.78 0.42

The correlation matrix is obtained pretty much as one would expect:

cor(swiss)  
                  Fertility Agriculture Examination   Education   Catholic
Fertility         1.0000000  0.35307918  -0.6458827 -0.66378886  0.4636847
Agriculture       0.3530792  1.00000000  -0.6865422 -0.63952252  0.4010951
Examination      -0.6458827 -0.68654221   1.0000000  0.69841530 -0.5727418
Education        -0.6637889 -0.63952252   0.6984153  1.00000000 -0.1538589
Catholic          0.4636847  0.40109505  -0.5727418 -0.15385892  1.0000000
Infant.Mortality  0.4165560 -0.06085861  -0.1140216 -0.09932185  0.1754959
                 Infant.Mortality
Fertility              0.41655603
Agriculture           -0.06085861
Examination           -0.11402160
Education             -0.09932185
Catholic               0.17549591
Infant.Mortality       1.00000000

We can obtain the data frame’s number of rows:

nrow(swiss)
[1] 47

or the summary of a single variable:

summary(swiss$Fertility)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  35.00   64.70   70.40   70.14   78.45   92.50 

We can also find all observations for which a feature takes on a value greater than a certain threshold, say:

swiss$Fertility>50
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

or provide summary information for the logical vector:

summary(swiss$Fertility>50)
   Mode   FALSE    TRUE 
logical       3      44 
table(swiss$Fertility>50)

FALSE  TRUE 
    3    44 

The logical vector can be used as an index: for instance, here is the dataset only for those observations where Fertility was greater than 50.

swiss[swiss$Fertility>50,]

with

nrow(swiss[swiss$Fertility>50,])
[1] 44

We could also replace the threshold; for instance, here is the dataset for observations data where Fertility is in the top 50%:

swiss[swiss$Fertility>median(swiss$Fertility),]
             Fertility Agriculture Examination Education Catholic
Courtelary        80.2        17.0          15        12     9.96
Delemont          83.1        45.1           6         9    84.84
Franches-Mnt      92.5        39.7           5         5    93.40
Moutier           85.8        36.5          12         7    33.77
Neuveville        76.9        43.5          17        15     5.16
Porrentruy        76.1        35.3           9         7    90.57
Broye             83.8        70.2          16         7    92.85
Glane             92.4        67.8          14         8    97.16
Gruyere           82.4        53.3          12         7    97.67
Sarine            82.9        45.2          16        13    91.38
Veveyse           87.1        64.5          14         6    98.61
Grandson          71.7        34.0          17         8     3.30
Oron              72.5        71.2          12         1     2.40
Payerne           74.2        58.1          14         8     5.23
Paysd'enhaut      72.0        63.5           6         3     2.56
Conthey           75.5        85.9           3         2    99.71
Herens            77.3        89.7           5         2   100.00
Martigwy          70.5        78.2          12         6    98.96
Monthey           79.4        64.9           7         3    98.22
Sierre            92.2        84.6           3         3    99.46
Sion              79.3        63.1          13        13    96.83
Le Locle          72.7        16.7          22        13    11.22
Val de Ruz        77.6        37.6          15         7     4.97
             Infant.Mortality
Courtelary               22.2
Delemont                 22.2
Franches-Mnt             20.2
Moutier                  20.3
Neuveville               20.6
Porrentruy               26.6
Broye                    23.6
Glane                    24.9
Gruyere                  21.0
Sarine                   24.4
Veveyse                  24.5
Grandson                 20.0
Oron                     21.0
Payerne                  23.8
Paysd'enhaut             18.0
Conthey                  15.1
Herens                   18.3
Martigwy                 19.4
Monthey                  20.2
Sierre                   16.3
Sion                     18.1
Le Locle                 18.9
Val de Ruz               20.0

or, solely the Fertility and Education variables for observations where Fertility is in the top 50%:

swiss[swiss$Fertility>median(swiss$Fertility),c("Fertility","Education")]
             Fertility Education
Courtelary        80.2        12
Delemont          83.1         9
Franches-Mnt      92.5         5
Moutier           85.8         7
Neuveville        76.9        15
Porrentruy        76.1         7
Broye             83.8         7
Glane             92.4         8
Gruyere           82.4         7
Sarine            82.9        13
Veveyse           87.1         6
Grandson          71.7         8
Oron              72.5         1
Payerne           74.2         8
Paysd'enhaut      72.0         3
Conthey           75.5         2
Herens            77.3         2
Martigwy          70.5         6
Monthey           79.4         3
Sierre            92.2         3
Sion              79.3        13
Le Locle          72.7        13
Val de Ruz        77.6         7

or those observations for which Fertility was maximal:

swiss[swiss$Fertility == max(swiss$Fertility),]
             Fertility Agriculture Examination Education Catholic
Franches-Mnt      92.5        39.7           5         5     93.4
             Infant.Mortality
Franches-Mnt             20.2
Exercises
  1. What do you think the following calls do?
swiss$var1 <- swiss[,1]>median(swiss[,1])  
swiss$var4 <- swiss[,4]>median(swiss[,4])   
table(swiss$var1)   
table(swiss$var4)   
table(swiss$var1,swiss$var4)   

1.3.4 A Word About NAs

NA values in R can create some havoc. Be careful!

To illustrate some of the issues, create a dataset by sampling 100 values (with replacement) among the values \(\{1,2,3,4,\text{NA}\}\).

test = sample(c(1:4,NA),100, replace=TRUE)   

We can summarize test as follows:

summary(test)   # 5pt summary + mean + number of NAs
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   1.500   3.000   2.549   3.500   4.000      29 

We can read the mean from the output, or we could try to compute it directly, using mean():

mean(test)   
[1] NA

What is happening? The function mean() does not know how it should handle the NA values; without further guidance, it elects to throw everything akimbo.

Compare with:

mean(test, na.rm=TRUE)   
[1] 2.549296
Exercises
  1. What do you think the following calls do?
median(test, na.rm=TRUE)   
min(test, na.rm=TRUE)   
max(test, na.rm=TRUE)  
quantile(test, na.rm=TRUE)   

1.3.5 Loops and Conditional Statements

R allows for flow control through loops and conditional statements:

  • if() and ifelse() – when a condition holds, do thing 1, when it does not, do thing 2;

  • for() – iterate a procedure for a fixed number of steps;

  • while() – repeat steps as long as some condition holds.

High-level interpreted languages (like R) are slower than low-level/compiled languages.

To get around this issue, interpreted languages will sometimes hand off (“behind the scenes”, so to speak) some operations to functions written in lower-level languages (like C).

In order to take advantage of this, certain programming strategies are recommended when working with list, vectors, arrays, data frames, and so on.

In particular, we try to avoid cycling through each item of a list, and instead use special functions that map a chosen function or operation to every item in the list (in R, this can be done with the apply family of functions, among others).

This can run counter to habits gained when learning other languages, in which for and while loops, for instance, might have been emphasized.

As such, we elect NOT to introduce for and while loops at this stage. The syntax is rather intuitive and will be easy to understand when we encounter it in examples.

The ifelse() statement is quite powerful and can speed-up and simplify data frame operations, however, and we take the time to illustrate how it can be used.

We can easily create a new swiss column determining whether the Fertility variable, say, is above a certain threshold (in which case it should take the value 1) or not (0):

swiss$threshold <-ifelse(swiss$Fertility>50,1,0)

There will be other opportunities to use these functions; the best way to get the hang of R is to practice and debug.