8.7 Example: Algae Blooms

This example is based on a Case Study by L.Torgo [74]. It provides a concrete illustration of the data preparation process on a realistic dataset: algae_blooms.csv, which is also available at the UCI Machine Learning Repository.

The ultimate problem is to predict the occurrence of harmful algae in water samples. Torgo also uses it to highlight various aspects of data exploration, data cleaning, and R syntax.

Readers who would prefer to try this example on their own are invited to skip this section and head to the first exercise of Section 8.8.

8.7.1 Problem Description

The ability to monitor and perform early forecasts of various river algae blooms is crucial to control the ecological harm they can cause.

The dataset which is used to train the learning model consists of:

  • chemical properties of various water samples of European rivers

  • the quantity of seven algae in each of the samples, and

  • the characteristics of the collection process for each sample.

What is the data science motivation for such a model? After all, we can analyze water samples to determine if various harmful algae are present or absent.

The answer is simple: chemical monitoring is cheap and easy to automate, whereas biological analysis of samples is expensive and slow.

Another answer is that analyzing the samples for harmful content does not provide a better understanding of algae drivers: it just tells us which samples contain algae.

8.7.2 Loading the Data

Before we can take a look at the data and begin the process in earnest, we need to load it in the in the R workspace. The dataset was downloaded from the UCI ML repository and stored in the CSV file algae_blooms.csv (in the data directory).

algae_blooms<-read.csv("data/algae_blooms.csv", sep=",", header=TRUE)

It is also available in Torgo’s DMwR package. As we will use some of its functions in this example, we will have to install and load that package itself. Unfortunately, it cannot be installed directly from CRAN with the install.packages() function.

Instead, we suggest doing the following:

  1. Download the package source DMwR_0.4.1.tar.gz from the DMwR CRAN archive page (additional information about the package is also available there) and save it locally to some path (in this example, the file was saved to the folder docs/code/).

  2. Install the following dependencies directly from CRAN:

install.packages(c("xts","quantmod","ROCR"))

The dependencies list might be different, based on the packages already installed locally; any eventual error message in the next step will inform you of the exact dependencies to install.

  1. Install DMwR from the package source:
install.packages("docs/code/DMwR_0.4.1.tar.gz", repos=NULL, type="source")
  1. Load the package:
algae_blooms <- as.data.frame(rbind(DMwR::algae,DMwR::algae.sols))

We can get a sense for the data frame’s structure by calling the str function.

str(algae_blooms)
'data.frame':   340 obs. of  18 variables:
 $ season: chr  "winter" "spring" "autumn" "spring" ...
 $ size  : chr  "small" "small" "small" "small" ...
 $ speed : chr  "medium" "medium" "medium" "medium" ...
 $ mxPH  : num  8 8.35 8.1 8.07 8.06 8.25 8.15 8.05 8.7 7.93 ...
 $ mnO2  : num  9.8 8 11.4 4.8 9 13.1 10.3 10.6 3.4 9.9 ...
 $ Cl    : num  60.8 57.8 40 77.4 55.4 ...
 $ NO3   : num  6.24 1.29 5.33 2.3 10.42 ...
 $ NH4   : num  578 370 346.7 98.2 233.7 ...
 $ oPO4  : num  105 428.8 125.7 61.2 58.2 ...
 $ PO4   : num  170 558.8 187.1 138.7 97.6 ...
 $ Chla  : num  50 1.3 15.6 1.4 10.5 ...
 $ a1    : num  0 1.4 3.3 3.1 9.2 15.1 2.4 18.2 25.4 17 ...
 $ a2    : num  0 7.6 53.6 41 2.9 14.6 1.2 1.6 5.4 0 ...
 $ a3    : num  0 4.8 1.9 18.9 7.5 1.4 3.2 0 2.5 0 ...
 $ a4    : num  0 1.9 0 0 0 0 3.9 0 0 2.9 ...
 $ a5    : num  34.2 6.7 0 1.4 7.5 22.5 5.8 5.5 0 0 ...
 $ a6    : num  8.3 0 0 0 4.1 12.6 6.8 8.7 0 0 ...
 $ a7    : num  0 2.1 9.7 1.4 1 2.9 0 0 0 1.7 ...

Evidently, algae_blooms is a data frame with 340 observations of 18 variables each.

Notes:

  • 3 of the fields are categorical (season, size, speed, which refer to the data collection process)

  • of the numerical fields, 8 have names that sound vaguely “chemical”

  • presumably, the remaining fields refer to the algae blooms

We can get a better feel for the data frame by observing it in its natural habitat, so to speak, using the head() or tail() functions.

head(algae_blooms)
season size speed mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla a1 a2 a3 a4 a5 a6 a7
winter small medium 8.00 9.8 60.800 6.238 578.000 105.000 170.000 50.0 0.0 0.0 0.0 0.0 34.2 8.3 0.0
spring small medium 8.35 8.0 57.750 1.288 370.000 428.750 558.750 1.3 1.4 7.6 4.8 1.9 6.7 0.0 2.1
autumn small medium 8.10 11.4 40.020 5.330 346.667 125.667 187.057 15.6 3.3 53.6 1.9 0.0 0.0 0.0 9.7
spring small medium 8.07 4.8 77.364 2.302 98.182 61.182 138.700 1.4 3.1 41.0 18.9 0.0 1.4 0.0 1.4
autumn small medium 8.06 9.0 55.350 10.416 233.700 58.222 97.580 10.5 9.2 2.9 7.5 0.0 7.5 4.1 1.0
winter small high 8.25 13.1 65.750 9.248 430.000 18.250 56.667 28.4 15.1 14.6 1.4 0.0 22.5 12.6 2.9

8.7.3 Summary and Visualization

As it happens, we are not given an awful lot of information about the dataset’s domain (I, for one, remain woefully ill-prepared to deal with matters of a chemical nature, to my eternal shame).

Data exploration, in the form of summaries and visualization, can provide a better handle on the problem at hand.

A call to the summary function provides frequency counts for categorical variables, and 6-pt summaries for numerical variables. As a bonus, the number of missing values is also tabulated.145

IMPORTANT NOTE: we may have given you the impression (just now) that exploration is only really necessary when domain expertise escapes us. Domain expertise can help analysts frame the problem and the analysis results in the appropriate manner, but it often also gives them a false sense of security. Errors can creep anywhere – data exploration at an early stage may save you a lot of embarrassing back-tracking at a later stage.

summary(algae_blooms)
season size speed mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla a1 a2 a3 a4 a5 a6 a7
Length:340 Length:340 Length:340 Min. :5.600 Min. : 1.500 Min. : 0.222 Min. : 0.000 Min. : 5.00 Min. : 1.00 Min. : 1.0 Min. : 0.200 Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000
Class :character Class :character Class :character 1st Qu.:7.750 1st Qu.: 7.925 1st Qu.: 10.994 1st Qu.: 1.147 1st Qu.: 37.86 1st Qu.: 13.00 1st Qu.: 40.0 1st Qu.: 2.133 1st Qu.: 1.50 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Mode :character Mode :character Mode :character Median :8.045 Median : 9.700 Median : 32.470 Median : 2.356 Median : 107.36 Median : 37.24 Median : 101.5 Median : 5.111 Median : 7.10 Median : 2.800 Median : 1.400 Median : 0.00 Median : 2.200 Median : 0.000 Median : 0.000
NA NA NA Mean :7.997 Mean : 9.157 Mean : 42.517 Mean : 3.121 Mean : 471.73 Mean : 73.09 Mean : 136.7 Mean : 12.796 Mean :16.70 Mean : 7.201 Mean : 3.904 Mean : 1.81 Mean : 5.516 Mean : 6.411 Mean : 2.206
NA NA NA 3rd Qu.:8.393 3rd Qu.:10.800 3rd Qu.: 57.750 3rd Qu.: 4.147 3rd Qu.: 244.90 3rd Qu.: 88.11 3rd Qu.: 200.2 3rd Qu.: 17.200 3rd Qu.:25.18 3rd Qu.:10.150 3rd Qu.: 4.625 3rd Qu.: 2.30 3rd Qu.: 8.000 3rd Qu.: 7.025 3rd Qu.: 2.200
NA NA NA Max. :9.700 Max. :13.400 Max. :391.500 Max. :45.650 Max. :24064.00 Max. :1435.00 Max. :1690.0 Max. :110.456 Max. :89.80 Max. :72.600 Max. :42.800 Max. :44.60 Max. :61.100 Max. :77.600 Max. :31.600
NA NA NA NA’s :2 NA’s :2 NA’s :16 NA’s :2 NA’s :2 NA’s :2 NA’s :7 NA’s :23 NA NA NA NA NA NA NA

Notes:

  • The chemical variables all have missing values, ranging from only 2 to 7, 16, and 23.

  • The observations seem fairly uniformly distributed in terms of the seasons, but large rivers and low speed rivers are not represented as often as their counterparts.

  • All numerical values are non-negative, which makes sense in the context of concentrations

  • We do not know what the range of the chemical values should take in a real-world context, but some of the maximum values seem … unrealistic (NH4!!, oPO4, a7, etc.)

  • Does anything else jump at you?

Of course, these summaries each apply to a single variable (1-way tables). Can we spot anything else using \(n\)-way tables (on the categorical variables)?

# 2-way tables
table(algae_blooms$speed,algae_blooms$size)
table(algae_blooms$speed,algae_blooms$season)
table(algae_blooms$season,algae_blooms$size)

# 3-way table
table(algae_blooms$season,algae_blooms$size,algae_blooms$speed)
        
         large medium small
  high      13     56    73
  low       32     24     2
  medium    38     56    46
        
         autumn spring summer winter
  high       32     34     38     38
  low        16     13     12     17
  medium     32     37     36     35
        
         large medium small
  autumn    19     33    28
  spring    21     34    29
  summer    19     36    31
  winter    24     33    33
, ,  = high

        
         large medium small
  autumn     3     13    16
  spring     3     14    17
  summer     3     16    19
  winter     4     13    21

, ,  = low

        
         large medium small
  autumn     9      6     1
  spring     7      6     0
  summer     6      6     0
  winter    10      6     1

, ,  = medium

        
         large medium small
  autumn     7     14    11
  spring    11     14    12
  summer    10     14    12
  winter    10     14    11

The 6-pt summary provides some information about the underlying distribution, but not much on the parametric front. A more traditional summary can be displayed using the psych library’s describe function.

psych::describe(algae_blooms)
vars n mean sd median trimmed mad min max range skew kurtosis se
season* 1 340 2.547059 1.1186895 3.0000 2.558823 1.482600 1.000 4.000 3.000 -0.0544729 -1.3652641 0.0606695
size* 2 340 2.111765 0.7676208 2.0000 2.139706 1.482600 1.000 3.000 2.000 -0.1915031 -1.2876572 0.0416301
speed* 3 340 1.994118 0.9120437 2.0000 1.992647 1.482600 1.000 3.000 2.000 0.0115387 -1.8012592 0.0494625
mxPH 4 338 7.997293 0.5783188 8.0450 8.035864 0.467019 5.600 9.700 4.100 -0.8402202 1.9865897 0.0314564
mnO2 5 338 9.156716 2.3130799 9.7000 9.388198 1.927380 1.500 13.400 11.900 -0.9392605 0.5497464 0.1258150
Cl 6 324 42.517246 44.4906037 32.4700 34.997591 33.478591 0.222 391.500 391.278 2.8485675 14.0485533 2.4717002
NO3 7 338 3.120784 3.2851622 2.3555 2.699504 2.181646 0.000 45.650 45.650 6.7874378 81.1290663 0.1786893
NH4 8 338 471.734411 1739.0774580 107.3570 156.746838 119.949753 5.000 24064.000 24059.000 9.1184171 105.4980942 94.5933434
oPO4 9 338 73.091882 114.1420517 37.2430 51.792802 43.460936 1.000 1435.000 1434.000 5.9906708 60.0469965 6.2085091
PO4 10 333 136.685699 149.4773125 101.4550 115.867652 114.999352 1.000 1690.000 1689.000 4.2262442 35.1788385 8.1913063
Chla 11 317 12.796196 18.0813363 5.1110 8.782608 6.094969 0.200 110.456 110.256 2.6186753 7.8575379 1.0155490
a1 12 340 16.701765 20.9987076 7.1000 12.593750 10.526460 0.000 89.800 89.800 1.5040883 1.4996118 1.1388148
a2 13 340 7.200882 10.7549412 2.8000 4.854412 4.151280 0.000 72.600 72.600 2.3280170 6.7216331 0.5832686
a3 14 340 3.904412 6.4205247 1.4000 2.358456 2.075640 0.000 42.800 42.800 2.4150758 6.7202844 0.3482018
a4 15 340 1.810000 3.8292948 0.0000 1.036765 0.000000 0.000 44.600 44.600 5.9516431 52.8944003 0.2076727
a5 16 340 5.515588 8.4186630 2.2000 3.692279 3.261720 0.000 61.100 61.100 2.7562479 10.4260075 0.4565661
a6 17 340 6.411471 12.3978237 0.0000 3.279779 0.000000 0.000 77.600 77.600 2.8805737 9.2050442 0.6723664
a7 18 340 2.206471 4.9472217 0.0000 1.040074 0.000000 0.000 31.600 31.600 4.0903606 18.5615144 0.2683008

Notes:

  • the categorical variables are marked by an asterisk *; the levels are coded with an integer, and treated as numerical variables for the purpose of the analysis, so the results for these fields are meaningless

  • the trimmed variable refers to the trimmed mean, the mean obtained when a certain percentage of the observations are removed from both end of the spectrum (what percentage, exactly?)

  • the mad variable refers to the median absolute deviation (from the median)

We personally find such a table hard to read and really grasp once there are more than a few variables in the dataset. Visualization comes in handy in such cases.

Basic histograms can be constructed with the hist() function.

hist(algae_blooms$mnO2)

Based on this histogram, we can conclude that the underlying distribution of mnO2 has a negative skew, say, which is confirmed by the table above.

While mnO2 clearly does not follow a normal distribution (no negative value, high skew), we see that such an approximation wouldn’t be so bad compared to a1.

hist(algae_blooms$a1)

\(qq-\)plots, another traditional statistical plot, can be produced with the car libary’s qqPlot() function.

car::qqPlot(algae_blooms$mnO2, main='Normal QQ plot for minimum values of O2', ylab="")

[1] 69 70

Again, we can see that the normal distribution is not a good fit for mnO2, but that it is even worse for a1 (see below).

car::qqPlot(algae_blooms$a1, main='Normal QQ plot for a1', ylab="")

[1]  49 118

Let us take a look at some of the odd values for NH4.

library(ggplot2)
ggplot(algae_blooms,aes(x=factor(0),y=NH4)) +   # plotting NH4 from the algae_blooms dataset ...
    geom_boxplot() +                            # as a boxplot ... 
    geom_rug() +                                # with a rug on which the true values are shown ...
    geom_hline(aes(yintercept=mean(algae_blooms$NH4, na.rm=TRUE)), linetype=2, colour="pink") + # and a horizontal line showing where the mean NH4 value falls ...
    ylab("Ammonium (NH4+)") +                   
    xlab("") +
    scale_x_discrete(breaks=NULL)

We see that there are a string of values falling way above the boxplot. If the underlying distribution was normal, say, these would definitely be considered outliers.

Let us investigate further.

plot(algae_blooms$NH4, xlab="", ylab="Ammonium (NH4+)") # scatter plot of NH4 against observation number
abline(h=mean(algae_blooms$NH4, na.rm=TRUE), lty=1)     # mean value of NH4, solid line
abline(h=mean(algae_blooms$NH4, na.rm=TRUE) + sd(algae_blooms$NH4, na.rm=TRUE), lty=2) # mean + sd value of NH4, dashed line
abline(h=median(algae_blooms$NH4, na.rm=TRUE), lty=3)   # median value of NH4, tight dashed line 

We can also look at the data and see which observations have values of NH4 below 3000 (roughly all values below the long dashed line above)

nrow(algae_blooms[-which(algae_blooms$NH4>3000),])  # seeing how many are being kept
[1] 329

What does the boxplot above looks like without the suspected outliers?

ggplot(algae_blooms[-which(algae_blooms$NH4>3000),],aes(x=factor(0),y=NH4)) + 
    geom_boxplot() + 
    geom_rug() +
    geom_hline(aes(yintercept=mean(algae_blooms[-which(algae_blooms$NH4>3000),8], na.rm=TRUE)), 
               linetype=2, colour="pink") +
    ylab("Ammonium (NH4+)") +
    xlab("") +
    scale_x_discrete(breaks=NULL)

It is a bit better, to be sure (the box structure has expanded). There still seems to be some very high values, however. Perhaps that’s to be expected? How would we find out?

Now, let us take a look at some of the algae levels.

ggplot(algae_blooms,aes(x=season,y=a3)) +   # plot a3 by season ...
    geom_boxplot() +                        # in a series of boxplots ...
    xlab("Season") +                        # with x axis as Seasons and y axis as a3
    ylab("Algae Level a3")

What does that tell you? Is it hard to get a good handle on the situation because the season are out of sequential order?

We can re-arrange the factors, but it requires a bit of fancy footwork using the forcats’library fct_relevel() function, and dplyr’s mutate().

library(forcats) # for fct_relevel
library(dplyr)   # for mutate
algae_blooms = mutate(algae_blooms, 
                     size=fct_relevel(size,c("small","medium","large")), # factors should appear in the desired order
                     speed=fct_relevel(speed,c("low","medium","high")),  # ditto
                     season=fct_relevel(season,c("spring","summer","autumn","winter")) # same here
                     )
ggplot(algae_blooms,aes(x=season,y=a3)) + 
    geom_boxplot() +
    xlab("Season") +
    ylab("Algae Level a3")

We only have 1 year’s worth of data, so it might be too early to tell, but it certainly seems as though the a3 levels decrease from spring to winter.

Violin plots are cousins to the boxplots. Can we get a bit more insight on the a3 trend?

ggplot(algae_blooms,aes(x=season,y=a3)) +  # plot a3 by season ...
    geom_violin() +                        # in a series of violin plots ...  
    geom_jitter() +                        # with some jitter to avoid all the points being on top of one another ...
    xlab("Season") + 
    ylab("Algae Level a3")

This plot certainly seems to suggest that a3 levels are cyclic, with a peak in the spring and low levels in the fall.

Let us return to NH4 for a second to see if we can spot a link with the season (as we did for a3). We only keep the observations for which the NH4 value is greater than 3000, and we bin them with respect to the quartiles.

First, filter the algae_blooms dataset to remove the 2 observations with missing values (is this always a good idea?)

f.NH4.data <- filter(algae_blooms,!is.na(NH4)) 
nrow(f.NH4.data)
[1] 338

Next we remove the 11 observations for which NH4 > 3000 (again, based on the mean + sd “argument”)

f.NH4.data <- filter(algae_blooms,!is.na(NH4)) |> 
     filter(NH4<3000) 
nrow(f.NH4.data)
[1] 327

Finally, we add a new variable to tell us in which quartile the NH4 value falls.

f.NH4.data <- filter(algae_blooms,!is.na(NH4))  |> 
     filter(NH4<3000)  |> 
     mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
nrow(f.NH4.data)
[1] 327

We can now use the new q.NH4 variable to make multi-variate comparisons, say between a1, NH4, and season.

ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) + # plot the a1 levels by season ...
    geom_point() +
    facet_wrap(~q.NH4) +                             # for each of the quartiles (each plot contains ~25% of the observations)
    ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")

That seems decidedly odd… why are we seeing missing values here? Have we not just removed the NAs? Let us delve in a bit deeper.

f.NH4.data[which(is.na(f.NH4.data$q.NH4)),]
table(f.NH4.data$q.NH4,useNA="ifany")
season size speed mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla a1 a2 a3 a4 a5 a6 a7 q.NH4
53 spring small medium 5.6 11.8 NA 2.22 5 1 1 NA 82.7 0 0 0 0 0 0 NA
223 autumn small high 5.9 11.9 NA 1.88 5 1 2 NA 75.2 0 0 0 0 0 0 NA
Var1 Freq
(5,36.8] 80
(36.8,103] 82
(103,226] 81
(226,2.17e+03] 82
NA 2

It seems as though the quartiles do not include their lower bound. We can remedy the situation by including an additional parameter in the mutate() call.

f.NH4.data <- filter(algae_blooms,!is.na(NH4)) |> 
    filter(NH4<3000) |> 
    mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1)), include.lowest=TRUE))

ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) +
    geom_point() +
    facet_wrap(~q.NH4) +
    ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")

The NAs have disappeared from the graph. In any case, it seems as though the a1 levels are inversely correlated with the NH4 levels but that the season does not have much of an effect (although we would need more work to conclude this with any degree of certainty).

We can create a similar graph for a3 instead of a1.

ggplot(f.NH4.data,aes(x=a3,y=season,color=season)) +
    geom_point() +
    facet_wrap(~q.NH4) +
    ggtitle("Algae Level a3 by Season and Ammonium Quartiles NH4")

8.7.4 Data Cleaning

We found some potential anomalies in the data when we explored the dataset (although we are electing to keep them in the dataset for the time being as we lack the domain expertise to make a reasonable decision on that front); now let us take some time to clean the data to some extent.

Anomalies come in various flavours; we have already explored some potential outlying behaviour, in this section we handle missing values.

library(dplyr)

The function complete.cases lists the observations for which every field is present (note that it says nothing about the validity of the case).

table(complete.cases(algae_blooms))
Var1 Freq
FALSE 34
TRUE 306

The vast majority of observations do not have missing cases, but a few still do. Is there anything special about them? Are the values missing completely at random?

str(filter(algae_blooms, !complete.cases(algae_blooms)))
summary(filter(algae_blooms, !complete.cases(algae_blooms)))
'data.frame':   34 obs. of  18 variables:
 $ season: Factor w/ 4 levels "spring","summer",..: 3 1 4 4 1 3 1 2 3 1 ...
 $ size  : Factor w/ 3 levels "small","medium",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ speed : Factor w/ 3 levels "low","medium",..: 3 3 1 3 2 2 3 3 2 2 ...
 $ mxPH  : num  6.8 8 NA 6.6 5.6 5.7 6.6 6.6 6.6 6.5 ...
 $ mnO2  : num  11.1 NA 12.6 10.8 11.8 10.8 9.5 10.8 11.3 10.4 ...
 $ Cl    : num  9 1.45 9 NA NA NA NA NA NA NA ...
 $ NO3   : num  0.63 0.81 0.23 3.25 2.22 ...
 $ NH4   : num  20 10 10 10 5 10 20 10 10 10 ...
 $ oPO4  : num  4 2.5 5 1 1 1 1 2 1 2 ...
 $ PO4   : num  NA 3 6 6.5 1 4 6 11 6 14 ...
 $ Chla  : num  2.7 0.3 1.1 NA NA NA NA NA NA NA ...
 $ a1    : num  30.3 75.8 35.5 24.3 82.7 16.8 46.8 46.9 47.1 66.9 ...
 $ a2    : num  1.9 0 0 0 0 4.6 0 0 0 0 ...
 $ a3    : num  0 0 0 0 0 3.9 0 0 0 0 ...
 $ a4    : num  0 0 0 0 0 11.5 28.8 13.4 0 0 ...
 $ a5    : num  2.1 0 0 0 0 0 0 0 0 0 ...
 $ a6    : num  1.4 0 0 0 0 0 0 0 1.2 0 ...
 $ a7    : num  2.1 0 0 0 0 0 0 0 0 0 ...
season size speed mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla a1 a2 a3 a4 a5 a6 a7
spring: 7 small :26 low : 4 Min. :5.600 Min. : 5.700 Min. : 0.222 Min. : 0.2300 Min. : 5.00 Min. : 1.000 Min. : 1.00 Min. : 0.30 Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
summer: 6 medium: 1 medium:13 1st Qu.:6.600 1st Qu.: 9.275 1st Qu.: 4.562 1st Qu.: 0.8025 1st Qu.: 10.00 1st Qu.: 1.000 1st Qu.: 6.00 1st Qu.: 1.78 1st Qu.:16.80 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
autumn: 8 large : 7 high :17 Median :7.225 Median :10.800 Median : 9.027 Median : 1.4440 Median : 11.84 Median : 3.667 Median : 10.83 Median : 4.00 Median :30.30 Median : 0.000 Median : 0.000 Median : 0.000 Median : 0.000 Median : 0.000 Median : 0.000
winter:13 NA NA Mean :7.344 Mean :10.192 Mean :19.312 Mean : 2.1684 Mean : 62.03 Mean : 25.676 Mean : 34.58 Mean :13.97 Mean :36.02 Mean : 4.503 Mean : 1.418 Mean : 2.335 Mean : 1.544 Mean : 1.138 Mean : 1.676
NA NA NA 3rd Qu.:8.000 3rd Qu.:11.300 3rd Qu.:25.238 3rd Qu.: 2.5725 3rd Qu.: 46.38 3rd Qu.: 20.250 3rd Qu.: 19.23 3rd Qu.:12.22 3rd Qu.:54.38 3rd Qu.: 3.400 3rd Qu.: 1.125 3rd Qu.: 1.975 3rd Qu.: 0.900 3rd Qu.: 0.000 3rd Qu.: 1.575
NA NA NA Max. :9.700 Max. :12.600 Max. :71.000 Max. :11.0200 Max. :500.00 Max. :295.667 Max. :380.00 Max. :68.05 Max. :83.00 Max. :36.500 Max. :14.600 Max. :28.800 Max. :21.100 Max. :14.500 Max. :28.000
NA NA NA NA’s :2 NA’s :2 NA’s :16 NA’s :2 NA’s :2 NA’s :2 NA’s :7 NA’s :23 NA NA NA NA NA NA NA

Right off the bat, missing cases seem to be over-represented in small rivers and under-represented in low-speed rivers. But upon further investigation (that is, comparing with the original dataset), we suspect that the under-representation of low-speed rivers is not problematic as it is in-line with the numbers in the larger dataset (by which we mean that low-speed rivers don’t seem to have a systematic missing value problem).

Let us assume for now (in the interest of efficiency) that the fact that small rivers have a lot of missing cases (mostly Cl and Chla) is also not a problem146

The bulk of the missing values seem to come from either Cl, Chla, or PO4. There is also a consistent 2 missing values across the board, but we cannot use the summary output to determine if they arise from the same two observations.

Which observations have missing NH4 values?

algae_blooms[which(is.na(algae_blooms$NH4)),]
season size speed mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla a1 a2 a3 a4 a5 a6 a7
62 summer small medium 6.4 NA NA NA NA NA 14 NA 19.4 0.0 0.0 2 0 3.9 1.7
199 winter large medium 8.0 7.6 NA NA NA NA NA NA 0.0 12.5 3.7 1 0 0.0 4.9

While these observations also have missing values in other fields, they do have some non-missing fields as well. But they are both missing 6 of the predictor variables. How useful could they be in training a predictive model? (that depends on the model, of course….)

We can write a function that will compute how many missing cases there are for each observations.

table(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))) # 1 for rows, 2 for columns
which(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))>2)
Var1 Freq
0 306
1 20
2 12
6 2
[1]  62 199

Most observations have no missing cases, which is great news. There are a few with 1 or 2, but observations 62 and 199 are wild cards, with 6 missing predictors (out of 11). Based on the small number of such wild cards, we elect to remove them from the analysis.

IMPORTANT NOTES:

  • If you decide to remove observations for any reason whatsoever, you need to document the process that lead you to eliminate it, and make that available.

  • Why do we remove the ones with 6 missing cases, but not the ones with 2 missing cases? If there’d been observations with 4 missing cases, what would you have done? What factors influence your decision?

Our new dataset still contains observations with missing cases, however.

algae_blooms.sna = algae_blooms[-which(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))>2),]
nrow(algae_blooms.sna)
[1] 338

What can we do with the other observations for which values are missing?

One possibility is to use the set of complete observations to compute a correlation matrix, and to see if any numerical field is strongly correlated with another field. That way, if there is a missing value in the first field, the second could be used to impute it.

IMPORTANT NOTE

  • this approach only works for variables that are linearly correlated to a single other variable. Non-linear correlations and multi-variate associations will not be uncovered.
library(corrplot)
corrplot(cor(algae_blooms.sna[,4:18], 
             use="complete.obs"),
         type="upper",tl.pos="d")

The correlation between PO4 (which has missing cases) and oPO4 (which does not, anymore) is clear. What is the nature of the relation? We use the set of complete cases to find it.

algae_blooms.nona <- algae_blooms.sna[-which(apply(algae_blooms.sna,1, 
                                                   function(x) 
                                                     sum(is.na(x)))>0),]
nrow(algae_blooms.nona)
[1] 306
PO4.oPO4.model = lm(PO4 ~ oPO4, data=algae_blooms.nona)
PO4.oPO4.model

Call:
lm(formula = PO4 ~ oPO4, data = algae_blooms.nona)

Coefficients:
(Intercept)         oPO4  
     51.811        1.203  

The regression function is PO4\(=51.811+1.203\)oPO4 (we are not particularly interested in the fit statistics at this point).

Intercept = PO4.oPO4.model$coefficients[[1]]
Slope = PO4.oPO4.model$coefficients[[2]]

What are the observations for which PO4 is missing?

which(is.na(algae_blooms.sna$PO4)==TRUE)
[1]  28 221 291 326 331 335

We can use the regression function to impute the missing PO4 values.

algae_blooms.sna2 <- algae_blooms.sna
algae_blooms.sna2$PO4 <- ifelse(is.na(algae_blooms.sna2$PO4),
                                max(Intercept + Slope*algae_blooms.sna2$oPO4,0),
                                algae_blooms.sna2$PO4)

We can clearly see that no values of PO4 are missing anymore.

which(is.na(algae_blooms.sna2$PO4)==TRUE)
integer(0)

That takes care of the missing values with strong linear correlation to another field. Where do we stand now?

summary(algae_blooms.sna2)
season size speed mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla a1 a2 a3 a4 a5 a6 a7
spring:84 small :120 low : 58 Min. :5.600 Min. : 1.500 Min. : 0.222 Min. : 0.000 Min. : 5.00 Min. : 1.00 Min. : 1.00 Min. : 0.200 Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.0
summer:85 medium:136 medium:138 1st Qu.:7.750 1st Qu.: 8.000 1st Qu.: 10.994 1st Qu.: 1.147 1st Qu.: 37.86 1st Qu.: 13.00 1st Qu.: 40.75 1st Qu.: 2.133 1st Qu.: 1.50 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.0
autumn:80 large : 82 high :142 Median :8.055 Median : 9.700 Median : 32.470 Median : 2.356 Median : 107.36 Median : 37.24 Median : 104.86 Median : 5.111 Median : 7.10 Median : 2.800 Median : 1.400 Median : 0.000 Median : 2.200 Median : 0.000 Median : 0.0
winter:89 NA NA Mean :8.002 Mean : 9.161 Mean : 42.517 Mean : 3.121 Mean : 471.73 Mean : 73.09 Mean : 166.18 Mean : 12.796 Mean :16.74 Mean : 7.207 Mean : 3.917 Mean : 1.812 Mean : 5.548 Mean : 6.438 Mean : 2.2
NA NA NA 3rd Qu.:8.400 3rd Qu.:10.800 3rd Qu.: 57.750 3rd Qu.: 4.147 3rd Qu.: 244.90 3rd Qu.: 88.11 3rd Qu.: 206.12 3rd Qu.: 17.200 3rd Qu.:25.32 3rd Qu.:10.025 3rd Qu.: 4.675 3rd Qu.: 2.300 3rd Qu.: 8.000 3rd Qu.: 7.075 3rd Qu.: 2.2
NA NA NA Max. :9.700 Max. :13.400 Max. :391.500 Max. :45.650 Max. :24064.00 Max. :1435.00 Max. :1777.93 Max. :110.456 Max. :89.80 Max. :72.600 Max. :42.800 Max. :44.600 Max. :61.100 Max. :77.600 Max. :31.6
NA NA NA NA’s :2 NA’s :1 NA’s :14 NA NA NA NA NA’s :21 NA NA NA NA NA NA NA

There aren’t as many missing values as before, but we still have some. And we have exhausted the correlation trick. What can we do?

There are many ways to tackle the problem, but we will use \(k\)NN imputation. The principle is simple:

  1. Using some similarity/distance metric (typically based on the Euclidean distance between points), identify the \(k\) nearest (complete) neighbours of each observation with a missing case.

  2. Compute the mean value of the missing case in the \(k-\)group of complete observations, and use that value as the imputed value.

IMPORTANT NOTES:

  • As we have seen when we were discussing, we often suggest scaling the data when dealing with distance metrics. We elected not to scale the data explicitly here. How much of an effect can that have?
  • We are going to be using DMwRs implementation of knnImputation() (below). How would you go about determining if the routine scales the data internally?
algae_blooms.sna2 <- DMwR::knnImputation(algae_blooms.sna2,k=10) # the choice of k=10 is arbitrary here. 
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 

Sure enough, there are no further observations with incomplete cases.

table(apply(algae_blooms.sna2,1, function(x) sum(is.na(x)))) # 1 for rows, 2 for columns

  0 
338 

8.7.5 Principal Components

Principal components analysis (PCA) is typically used on the (numeric) predictor variables. There are methods that can be used to combine numeric and categorical variables, but for the purposes of this example, we will simply ignore the categorical fields.

pca.algae = algae_blooms.sna2[,4:11]
head(pca.algae)
mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla
8.00 9.8 60.800 6.238 578.000 105.000 170.000 50.0
8.35 8.0 57.750 1.288 370.000 428.750 558.750 1.3
8.10 11.4 40.020 5.330 346.667 125.667 187.057 15.6
8.07 4.8 77.364 2.302 98.182 61.182 138.700 1.4
8.06 9.0 55.350 10.416 233.700 58.222 97.580 10.5
8.25 13.1 65.750 9.248 430.000 18.250 56.667 28.4

We can scale the data frame using the scale() function in R:

head(scale(pca.algae))
mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla
-0.0019358 0.2748339 0.4423909 0.9488773 0.0611046 0.2795474 0.0145250 2.1362239
0.6101140 -0.5036258 0.3731037 -0.5578976 -0.0584991 3.1159254 1.4939011 -0.6253276
0.1729355 0.9667981 -0.0296707 0.6724831 -0.0719160 0.4606113 0.0794349 0.1855592
0.1204741 -1.8875541 0.8186771 -0.2492370 -0.2147992 -0.1043426 -0.1045862 -0.6196571
0.1029870 -0.0711482 0.3185826 2.2206563 -0.1368740 -0.1302752 -0.2610671 -0.1036382
0.4352426 1.7020100 0.5548406 1.8651183 -0.0239980 -0.4804704 -0.4167602 0.9113879

Notice the different values in the dataset. The principal components are obtained via the princomp() function.

pca.1 = princomp(scale(pca.algae))
summary(pca.1)
Importance of components:
                          Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
Standard deviation     1.5334279 1.2022703 1.0900775 0.9329798 0.87500293
Proportion of Variance 0.2947973 0.1812179 0.1489744 0.1091293 0.09598775
Cumulative Proportion  0.2947973 0.4760152 0.6249896 0.7341189 0.83010663
                          Comp.6     Comp.7     Comp.8
Standard deviation     0.8081581 0.65499534 0.52248196
Proportion of Variance 0.0818822 0.05378649 0.03422468
Cumulative Proportion  0.9119888 0.96577532 1.00000000

If we can live with 75% of the variability in the numerical component of the predictors being explained by principal components, than we can reduce the dataset dimension by 4.

reduced.algae = data.frame(algae_blooms.sna2[,1:3],pca.1$scores[,1:4],algae_blooms.sna2[12:17])
head(reduced.algae)
  season  size  speed    Comp.1     Comp.2     Comp.3      Comp.4   a1   a2
1 winter small medium 1.0634865  0.2877982  1.6268674  0.26280655  0.0  0.0
2 spring small medium 2.1808329  0.5908937 -2.1258588  1.07795406  1.4  7.6
3 autumn small medium 0.2097857 -0.4229638  0.6216107  0.52245855  3.3 53.6
4 spring small medium 0.5213826  0.7948266 -1.0031723 -1.46518459  3.1 41.0
5 autumn small medium 0.6477956 -0.8568203  1.1386691 -0.78026048  9.2  2.9
6 winter small   high 0.1709107 -0.7049677  2.4338593  0.02250525 15.1 14.6
    a3  a4   a5   a6
1  0.0 0.0 34.2  8.3
2  4.8 1.9  6.7  0.0
3  1.9 0.0  0.0  0.0
4 18.9 0.0  1.4  0.0
5  7.5 0.0  7.5  4.1
6  1.4 0.0 22.5 12.6

Whether this reduction proves useful or not will ultimately depend on what we would like to accomplish with the data; we will study this dataset again in the Regression and Value Estimation module.

References

[74]
L. Torgo, Data Mining with R, 2nd ed. CRC Press, 2016.