13.2 Simple Classification Methods

Qualitative variables take values (the levels) in an unordered set \(\mathcal{C}=\{C_1,\ldots,C_K\}\).For instance,

  • \(\text{hair colour}\in \{\text{black},\text{red},\text{blond},\text{grey}, \text{other}\}\)

  • \(\text{email message}\in \{\text{ham},\text{spam}\}\)

  • \(\text{life expectancy} \in \{\text{high},\text{low}\}\)

For a training set \(\text{Tr}\) with observations \((\vec{X},Y)\in \mathbb{R}^{p}\times \mathcal{C}\), the classification problem is to build a classifier \(\hat{C}:\mathbb{R}^p\to \mathcal{C}\) to approximate the optimal Bayes classifier \(C:\mathbb{R}^p\to \mathcal{C}\) (as discussed in the previous section).

In many instances, we might be more interested in the probabilities \[\pi_k(\mathbf{x})=P\{\hat{C}(\mathbf{x})=C_k\},\quad k=1,\ldots,K\] than in the classification predictions themselves. Typically, the classifier \(\hat{C}\) is built on a training set \[\text{Tr}=\{(\mathbf{x}_j,y_j)\}_{j=1}^N\] and evaluated on a testing set \[\text{Te}=\{(\mathbf{x}_i,y_i)\}_{i=N+1}^M.\]

Example

Let us revisit the Gapminder dataset, again focusing on observations from 2011, with the difference that life expectancy is now recorded as “high” \((1)\) if it falls above \(72.45\) (the median in 2011), and as “low” \((0)\) otherwise.

library(dplyr)
gapminder.ML = as.data.frame(unclass(data.frame(read.csv("Data/gapminder.csv"))),
                          stringsAsFactors=TRUE)
gapminder.ML <- gapminder.ML[complete.cases(gapminder.ML),]
gapminder.ML <- gapminder.ML[,c("country","year","region","continent",
                          "population","infant_mortality","fertility","gdp",
                          "life_expectancy")]

gapminder.2011 <- gapminder.ML |> filter(year==2011) 

gapminder.2011 <- gapminder.2011 |> 
  mutate(LE=as.factor(ifelse(life_expectancy<median(life_expectancy),
                             "low","high"))) 

The structure and summary are provided below:

summary(gapminder.2011[,c("infant_mortality","fertility", "LE")])
 infant_mortality    fertility        LE    
 Min.   :  1.800   Min.   :1.260   high:83  
 1st Qu.:  7.275   1st Qu.:1.792   low :83  
 Median : 16.900   Median :2.420            
 Mean   : 27.333   Mean   :2.931            
 3rd Qu.: 41.125   3rd Qu.:3.908            
 Max.   :106.800   Max.   :7.580            

Let us assume that we are interested in modeling the response LE (\(Y\)) as a linear response of the predictors \(X_1\) (infant mortality) and \(X_2\) (fertility), using ordinary linear regression (OLS; see Module ?? for a detailed discussion of such models).

p1 <- ggpubr::ggboxplot(gapminder.2011, x = "LE", y = "infant_mortality",
                fill = "LE", palette = "jco", 
                xlab="Life Expectancy", ylab="Infant Mortality") +  
                ggpubr::rremove("legend")

p2 <- ggpubr::ggboxplot(gapminder.2011, x = "LE", y = "fertility",
                fill = "LE", palette = "jco", 
                xlab="Life Expectancy", ylab="Fertility") +  
                ggpubr::rremove("legend")


grid::pushViewport(grid::viewport(layout = grid::grid.layout(nrow = 1, ncol = 2)))

# A helper function to define a region on the layout
define_region <- function(row, col){
  grid::viewport(layout.pos.row = row, layout.pos.col = col)
} 

print(p1, vp = define_region(row = 1, col = 1))
print(p2, vp = define_region(row = 1, col = 2))
Infant mortality and fertility rates for the life expectancy classes in the 2011 Gapminder data.

Figure 13.3: Infant mortality and fertility rates for the life expectancy classes in the 2011 Gapminder data.

We could run a simple linear regression of \(Y\) on \(\vec{X}\) over \(\text{Tr}\) and obtain the model \[\hat{Y}=\hat{\beta}_0+\hat{\beta}_1\cdot \text{infant mortality}+\hat{\beta}_2\cdot \text{fertility},\] from which we would classify an observation’s life expectancy as \[\hat{C}(\vec{X})=\begin{cases}\text{high} & \text{if }\hat{Y}>0.5 \\ \text{low} & \text{else}\end{cases}\]

gapminder.2011 <- gapminder.2011 |> mutate(LE.resp=ifelse(LE=="high",1,0))
model.class <- lm(LE.resp~infant_mortality+fertility, data=gapminder.2011)
beta_0=as.numeric(model.class[[1]][1])
beta_1=as.numeric(model.class[[1]][2])
beta_2=as.numeric(model.class[[1]][3])
model.class[[1]]
     (Intercept) infant_mortality        fertility 
      1.00102979      -0.01188533      -0.06010533 

Thus, \[\hat{Y}=1.001-0.012\cdot \text{infant mortality}-0.060\cdot \text{fertility}.\] Let us plot the decision boundary on the scatterplot of the domain:

slope = -beta_1/beta_2
intercept = 0.5*(1-2*beta_0)/beta_2

ggpubr::ggscatter(gapminder.2011, x="infant_mortality", y="fertility",
          shape="LE", color="LE", palette="jco", size = 2, 
          xlab="Infant Mortality", ylab = "Fertility", 
          title = "Gapminder 2011 Data") + 
  ggplot2::geom_abline(intercept = intercept, slope = slope, 
              color="red", linetype="dashed", size=1.5)
Linear regression decision boundary for the Gapminder 2011 example.

Figure 13.4: Linear regression decision boundary for the Gapminder 2011 example.

In that case, the OLS approach is likely to do a decent job, as the data is roughly linearly separable over the predictors. This will not usually be the case, however.

In the example above, the optimal regression function is \[f(\mathbf{x})=\text{E}[Y\mid \vec{X}=\mathbf{x}]=P(Y=1\mid \vec{X}=\mathbf{x})=p_1(\mathbf{x})\] because \(Y\) is a binary variable; this might lead us to believe that \(f(\mathbf{x})\) could also be used to directly classify and determine the class probabilities for the data, in which case there would be no need for a separate classification apparatus.

There is one major drawback with this approach: if linear regression is used to model the data (which is to say, if we assume that \(f(\mathbf{x})\approx \mathbf{x}^{\!\top}\boldsymbol{\beta}\)), we need to insure that \(\hat{f}_{\text{OLS}}(\mathbf{x})\in [0,1]\) for all \(\mathbf{x}\in \text{Te}\). This, in general, cannot be guaranteed.

Another problem arises if we study the residual situation further. If we model \(Y=\{0,1\}\) with an OLS regression, we have \[Y_i=\mathbf{x}_i^{\!\top}\boldsymbol{\beta}+\varepsilon_i.\] Thus \[\varepsilon_i=Y_i-\mathbf{x}_i^{\!\top}\boldsymbol{\beta}=\begin{cases}1-\mathbf{x}_i^{\!\top}\boldsymbol{\beta} & \text{if $Y_i=1$} \\ -\mathbf{x}_i^{\!\top}\boldsymbol{\beta} & \text{if $Y_i=0$}\end{cases}\] But OLS assumes that \(\boldsymbol{\varepsilon}\sim \mathcal{N}(\mathbf{0},\sigma^2I)\), which is clearly not the case here, as \(\varepsilon_i\) can only take two values. OLS is thus not an appropriate way to model the response.

Furthermore, \[\text{Var}(Y_i)=p_1(\mathbf{x}_i)(1-p_1(\mathbf{x}_i)),\] since \(Y_i\) is a binomial random variable, and \[\text{Var}(\varepsilon_i)=\text{Var}(Y_i-p_1(\mathbf{x}_i))=\text{Var}(Y_i)= p_1(\mathbf{x}_i)(1-p_1(\mathbf{x}_i)),\] which is not constant as it depends on \(\mathbf{x}_i\).

The OLS assumptions are thus violated at every turn225 – OLS is simply not a good fit/modeling approach to estimate \[p_k(\mathbf{x})=P(Y=C_k\mid \vec{X}=\mathbf{x}).\]

In what follows, we introduce two simple classification methods (see [3] for more details).

13.2.1 Logistic Regression

The problems presented above point to OLS not being an ideal method for classification, but the linear regression still provided a good separator in the Gapminder example. This suggests that we should not automatically reject the possibility of first transforming the data and then seeing if OLS might not provide an appropriate modeling strategy on the transformed data.

Formulations

In logistic regression, we are seeking an invertible transformation \(g:\mathbb{R}\to [0,1]\), \(g(y^*)=y\), such that \(g^{-1}(y_i)=y_i^*\). This quantity must behave like a probability; in the 2-class setting, we use \(g_L(y^*)\) to approximate the probability \[p_1(\mathbf{x})=P(Y=1\mid \vec{X}=\mathbf{x}).\] The idea is to run OLS on a transformed training set \[\text{Tr}^*=\{(\mathbf{x}_i,y_i^*)\}_{i=1}^N,\] and to transform the results back using \(y_i=g(y_i^*)\).

There are many such functions: the probit model,226 which we will not discuss, and the logit model regression model are two common approaches.

Logit Model

The logit model uses the transformation \[y=g_L(y^*)=\frac{e^{y^*}}{1+e^{y^*}}.\] It is such that \[g^{-1}_L(0)=-\infty,\quad g^{-1}_L(1)=\infty,\quad g^{-1}_L(0.5)=0, \quad \text{etc.}\] We solve for \(y^*\) in order to get a transformed response \(y^*\in \mathbb{R}\) (instead of one restricted to \([0,1]\)): \[p_1(\mathbf{x})=\frac{e^{y^*}}{1+e^{y^*}}\iff y^*=g^{-1}_L(y)=\ln\left(\frac{p_1(\mathbf{x})}{1-p_1(\mathbf{x})}\right).\] It is the log-odds transformed observations that we attempt to fit with an OLS model: \[\hat{Y}^*=\ln\left(\frac{p_1(\mathbf{x})}{1-p_1(\mathbf{x})}\right)=\beta_0+\beta_1X_1+\cdots+\beta_pX_p=\mathbf{x}^{\!\top}\boldsymbol{\beta}.\] In order to make a prediction for \(p_1(\mathbf{x})\), we estimate \(y^*\) and use the logit transformation to obtain \(y\). For instance, if \(\mathbf{x}^{\!\top}\hat{\boldsymbol{\beta}}=0.68\), then \[\hat{y}^*=\ln\left(\frac{\hat{p}_1(\mathbf{x})}{1-\hat{p}_1(\mathbf{x})}\right)=0.68\] and \[\hat{p}_1(\mathbf{x})=\frac{e^{y^*}}{1+e^{y^*}}=\frac{e^{0.68}}{1+e^{0.68}}=0.663.\] Depending on the decision rule threshold \(\gamma\), we may thus predict that \(\hat{C}(\mathbf{x})=C_1\) if \(p_1(\mathbf{x})>\gamma\) or \(\hat{C}(\mathbf{x})=C_2\), otherwise.

The technical challenge is in obtaining the coefficients \(\hat{\boldsymbol{\beta}}\); they are found by maximizing the likelihood (see [32]) \[\begin{aligned} L(\boldsymbol{\beta})&=\prod_{y_i=1}p_1(\mathbf{x}_i)\prod_{y_i=0}(1-p_1(\mathbf{x}_i))\\&=\prod_{y_i=1}\frac{\exp(\mathbf{x_i}^{\!\top}\boldsymbol{\beta})}{1+\exp(\mathbf{x_i}^{\!\top}\boldsymbol{\beta})}\prod_{y_i=0}\frac{1}{1+\exp(\mathbf{x_i}^{\!\top}\boldsymbol{\beta})},\end{aligned}\] or, more simply: \[\begin{aligned} \hat{\boldsymbol{\beta}}&=\arg\max_{\boldsymbol{\beta}} \{L(\boldsymbol{\beta})\}=\arg\max_{\boldsymbol{\beta}} \{\ln L(\boldsymbol{\beta})\} \\ &=\arg\max_{\boldsymbol{\beta}} \left\{\sum_{y_i=1}\ln p_1(\mathbf{x}_i)+\sum_{y_i=0}\ln (1-p_1(\mathbf{x}_i))\right\} \\ &= \text{... (terms in $\boldsymbol{\beta}$ and the observations $\mathbf{x}_i$).}\end{aligned}\] The optimizer \(\hat{\boldsymbol{\beta}}\) is then found using numerical methods; in R, the function glm() computes the maximum likelihood estimate directly.

Example

Using the Gapminder data from the start of the section, we obtain the following model:

model.LR <- glm(LE.resp~infant_mortality+fertility, data=gapminder.2011,family=binomial)
model.LR

Call:  glm(formula = LE.resp ~ infant_mortality + fertility, family = binomial, 
    data = gapminder.2011)

Coefficients:
     (Intercept)  infant_mortality         fertility  
         4.58733          -0.22499          -0.06495  

Degrees of Freedom: 165 Total (i.e. Null);  163 Residual
Null Deviance:      230.1 
Residual Deviance: 78.17    AIC: 84.17

Thus

\[\hat{y}^*=\ln\left(\frac{P(Y=\text{high}\mid \vec{X})}{1-P(Y=\text{high}\mid \vec{X})}\right)=4.59-0.22X_1-0.06X_2.\] For a decision rule threshold of \(\gamma=0.5\), the decision boundary is shown below (compare with the linear regression boundary of Figure 13.4).

beta_0 = as.numeric(model.LR[[1]][1])
beta_1 = as.numeric(model.LR[[1]][2])
beta_2 = as.numeric(model.LR[[1]][3])

slope = -beta_1/beta_2
intercept = 0.5*(1-2*beta_0)/beta_2

ggpubr::ggscatter(gapminder.2011, x="infant_mortality", y="fertility",
          shape="LE", color="LE", palette="jco", size = 2, 
          xlab="Infant Mortality", ylab = "Fertility", 
          title = "Gapminder 2011 Data") + 
  ggplot2::geom_abline(intercept = intercept, slope = slope, 
              color="red", linetype="dashed", size=1.5)
Linear regression decision boundary for the Gapminder 2011 example.

Figure 13.5: Linear regression decision boundary for the Gapminder 2011 example.

What is the estimated probability that the life expectancy is high in a country whose infant mortality is 15 and whose fertility is 4? By construction, \[\begin{aligned} p_1(Y&=\text{high}|X_1=15,X_2=4)\approx g_L([1,15,24]^{\!\top}\hat{\boldsymbol{\beta}})\\& =\frac{\exp(4.59-0.22(15)-0.06(24))}{1+\exp(4.59-0.22(15)-0.06(24))} \\ &= \frac{\exp(0.9526322)}{1+\exp(0.9526322)}=0.72.\end{aligned}\]

At this point, we might be wondering how all of this squares up with the statistical learning framework of Sections 11 and 12. No testing set has made an appearance, no misclassification or mean squared error rate has been calculated.

We randomly select 116 observations, say, and train a logistic regression model on this training set \(\text{Tr}\) to obtain:

set.seed(0)
ind.train = sample(nrow(gapminder.2011),round(0.7*nrow(gapminder.2011)),replace=FALSE)
gapminder.2011.tr = gapminder.2011[ind.train,]
gapminder.2011.te = gapminder.2011[-ind.train,]

model.LR.tr <- glm(LE.resp~infant_mortality+fertility, family=binomial, data=gapminder.2011.tr)
model.LR.tr

Call:  glm(formula = LE.resp ~ infant_mortality + fertility, family = binomial, 
    data = gapminder.2011.tr)

Coefficients:
     (Intercept)  infant_mortality         fertility  
          6.1194           -0.2050           -0.6653  

Degrees of Freedom: 115 Total (i.e. Null);  113 Residual
Null Deviance:      159.1 
Residual Deviance: 50.83    AIC: 56.83

So that \[\hat{y}^*=6.12-0.21x_1-0.67x_2.\]

Now, compute \[\hat{p}_i=P(Y_i=\text{high}|X_1=x_{1,i},X_2=x_{2,i})=\frac{\exp(\hat{y}_i^*)}{1+\exp(\hat{y}_i^*)}\] on the observations in the testing set \(\text{Te}\) (see below).

beta_0 = as.numeric(model.LR[[1]][1])
beta_1 = as.numeric(model.LR[[1]][2])
beta_2 = as.numeric(model.LR[[1]][3])

gapminder.2011.te$y.star = beta_0 + beta_1*gapminder.2011.te$infant_mortality + 
  beta_2*gapminder.2011.te$fertility
gapminder.2011.te$p.1 = exp(gapminder.2011.te$y.star)/(1+exp(gapminder.2011.te$y.star))
gapminder.2011.te$MSE = (gapminder.2011.te$p.1-gapminder.2011.te$LE.resp)^2

summary(gapminder.2011.te$LE.resp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    0.00    0.36    1.00    1.00 

We obtain \[\text{MSE}_\text{Te}=\frac{1}{50}\sum_{i=1}^{50}(\hat{p}_i-\mathcal{I}[Y_i=\text{high}])^2=0.075.\] Is that a good test error? It is difficult to answer without more context. Perhaps a more intuitive way to view the situation is to make actual predictions and to explore their quality.

ggpubr::ggscatter(gapminder.2011.te, x="p.1", y="LE.resp",
          shape="LE", color="LE", palette="jco", size = 3, 
          xlab="P(Y=1)", ylab = "Life Expectancy Groups", 
          title = "Gapminder 2011 Data - Testing Set Predictions")

For \(\alpha\in [0,1]\), define \[\text{pred}_i(\alpha)=\begin{cases}\text{high} & \text{if $\hat{p}_i>\alpha$} \\ \text{low} & \text{else}\end{cases}\]

In the specific version of \(\text{Te}\) used in this example, 36% of the nations had a high life expectancy.

gapminder.2011.te$pred81 = ifelse(gapminder.2011.te$p.1>0.81,1,0)
table(gapminder.2011.te$LE.resp,gapminder.2011.te$pred81)

If we set \(\alpha=0.81\), then the model predicts that 36% of the test nations will have a high life expectancy, and the confusion matrix on \(\text{Te}\) is shown below:

\(\alpha=0.81\) prediction
0 1
actual 0 30 2
1 2 16

But why do we pick \(\alpha=0.81\) instead of \(\alpha=0.5\), say (which may be the only rational choice in the absence of information)? In the latter case, 42% of nations are predicted to have high life expectancy, and the confusion matrix on \(\text{Te}\) is as below.

gapminder.2011.te$pred50 = round(gapminder.2011.te$p.1,0)
table(gapminder.2011.te$LE.resp,gapminder.2011.te$pred50)
\(\alpha=0.5\) prediction
0 1
actual 0 28 4
1 2 17

We will revisit this question in Section 13.2.3.

13.2.2 Discriminant Analysis

In logistic regression, we model \(P(Y=C_k\mid \mathbf{x})\) directly via the logistic function \[p_1(\mathbf{x})=\frac{\exp(\mathbf{x}^{\!\top}\boldsymbol{\hat{\beta}})}{1+\exp(\mathbf{x}^{\!\top}\boldsymbol{\hat{\beta}})}.\]

We have discussed some of the properties of the process in the previous section, but it should be noted that logistic regression is sometimes contra-indicated:

  • when the classes are well-separated, the coefficient estimates may be unstable (adding as little as one additional point to \(\text{Tr}\) could change the coefficients substantially);

  • when \(\text{Tr}\) is small and the distribution of the predictors is roughly Gaussian in each of the classes \(Y=C_k\), the coefficient estimates may be unstable too;

  • when there are more than 2 response levels, it is not always obvious how to select an extension of logistic regression.

 

In discriminant analysis (DA), we instead model \[P(\mathbf{x}\mid Y=C_k),\] the distribution of the predictors \(\vec{X}\) conditional on the level of \(Y\), and use Bayes’ Theorem to obtain \[P(Y=C_k\mid \mathbf{x}),\] the probability of observing the response conditional on the predictors.

Let \(\mathcal{C}=\{C_1,\ldots,C_K\}\) be the \(K\) response levels, \(K\geq 2\), and denote by \(\pi_k\) the probability that a random observation lies in \(C_k\), for \(k\in \{1,\ldots, K\}\); \(\pi_k\) is the prior \[\pi_k=P(Y=C_k)=\frac{|C_k|}{N}.\] Let \(f_k(\mathbf{x})=P(\mathbf{x}\mid Y=C_k)\) be the conditional density function of the distribution of \(\vec{X}\) in \(C_k\); we would expect \(f_k(\mathbf{x})\) to be large if there is a high probability that an observation in \(C_k\) has a corresponding predictor \(\vec{X}\approx \mathbf{x}\), and small otherwise.

According to Bayes’ Theorem, \[\begin{aligned} p_k(\mathbf{x})&=P(Y=C_k\mid \mathbf{x})\\ &=\frac{P(\mathbf{x}\mid Y=C_k)\cdot P(Y=C_k)}{P(\mathbf{x})} \\ & =\frac{P(\mathbf{x}\mid Y=C_k)\cdot P(Y=C_k)}{\displaystyle{{\sum_{j=1}^KP(\mathbf{x}\mid Y=C_j)\cdot P(Y=C_j)}}}\\ & =\frac{\pi_kf_k(\mathbf{x})}{\displaystyle{{\pi_1f_1(\mathbf{x})+\cdots + \pi_Kf_k(\mathbf{x})}}}.\end{aligned}\]

Given an observation \(\mathbf{x}\in \text{Te}\), the DA classifier is \[\hat{C}_{\text{DA}}(\mathbf{x})=C_{\arg\max_j \{p_{j(\mathbf{x})}\}}.\] In order to say more about discriminant analysis, we need to make additional assumptions on the nature of the underlying distributions.

Linear Discriminant Analysis

If there is only one predictor (\(p=1\)), we make the Gaussian assumption, \[f_k(x)=\frac{1}{\sqrt{2\pi}\sigma_k}\exp\left[-\frac{1}{2}\left(\frac{x-\mu_k}{\sigma_k}\right)^2\right],\] where \(\mu_k\) and \(\sigma_k\) are the mean and the standard deviation, respectively, of the predictor for all observations in class \(C_k\).227

If we further assume that \(\sigma_k\equiv \sigma\) for all \(k\), then \[\begin{aligned} p_k(x)&=\frac{\displaystyle{\pi_k\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{1}{2}\left(\frac{x-\mu_k}{\sigma}\right)^2\right]}}{\displaystyle{\sum_{j=1}^K\pi_j\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{1}{2}\left(\frac{x-\mu_j}{\sigma}\right)^2\right]}} \\ &=\frac{\displaystyle{\pi_k\exp\left[-\frac{1}{2}\left(\frac{x-\mu_k}{\sigma}\right)^2\right]}}{\displaystyle{\sum_{j=1}^K\pi_j\exp\left[-\frac{1}{2}\left(\frac{x-\mu_j}{\sigma}\right)^2\right]}} \\ &=\frac{\displaystyle{\pi_k\exp\left[\frac{\mu_k}{\sigma^2}\left(x-\frac{\mu_k}{2}\right)\right]\exp\left(-\frac{x^2}{2\sigma^2}\right)}}{\displaystyle{\sum_{j=1}^K\pi_j\exp\left[\frac{\mu_j}{\sigma^2}\left(x-\frac{\mu_j}{2}\right)\right]\exp\left(-\frac{x^2}{2\sigma^2}\right)}} \\ &=\pi_k\exp\left[\frac{\mu_k}{\sigma^2}\left(x-\frac{\mu_k}{2}\right)\right]\cdot A(x).\end{aligned}\]

We do not have to compute the actual probabilities \(p_k(x)\) directly if we are only interested in classification; in that case, the discriminant score for each class may be more useful: \[\delta_k(x)=\ln p_k(x)=\ln \pi_k+x\frac{\mu_k}{\sigma^2}-\frac{\mu_k}{2\sigma^2}+\ln A(x).\]

As \(\ln A(x)\) is the same for all \(k\), we can drop it from the score (it does not contribute to relative differences in class scores); given an observation \(x\in \text{Te}\), the linear discriminant analysis (LDA) classifier with \(p=1\) is \[\hat{C}_{\text{LDA}}(\mathbf{x})=C_{\arg\max_j \{\hat{\delta}_{j(\mathbf{x})}\}}.\]

Under other assumptions on the density function, the discriminant score formulation may change. The “linear” in LDA comes from the linearity of the discriminant scores \(\delta_k\) (after the \(\ln A(x)\) term has been dropped). If \(K=2\) and \(\pi_1=\pi_2=0.5\), the midpoint \(x^*=\frac{1}{2}(\mu_1+\mu_2)\) of the predictor means in \(C_1\) and \(C_2\) plays a crucial role. Indeed, the discriminant scores \(\delta_1(x)\) and \(\delta_2(x)\) meet when \[\begin{aligned} x^*\frac{\mu_1}{\sigma^2}-\frac{\mu_1^2}{2\sigma^2}=x^*\frac{\mu_2}{\sigma^2}-\frac{\mu_2^2}{2\sigma^2} \implies x^*=\frac{\mu_1+\mu_2}{2},\end{aligned}\] as long as \(\mu_1\neq \mu_2\). If \(\mu_1<\mu_2\), say, then the decision rule simplifies to \[\hat{C}(x)=\begin{cases}C_1 & \text{if $x\leq x^*$} \\ C_2 & \text{if $x> x^*$}\end{cases}\]

The principle is illustrated in Figure 13.6.

Midpoint of two theoretical normal distributions (dashed line); midpoint of two empirical normal distributions (solid line). Observations to the left of the decision boundary are classified as green, those to the right as purple. [@DSML_ISLR]

Figure 13.6: Midpoint of two theoretical normal distributions (dashed line); midpoint of two empirical normal distributions (solid line). Observations to the left of the decision boundary are classified as green, those to the right as purple. [3]

In practice, we estimate \(\pi_k\), \(\mu_k\) and \(\sigma\) from \(\text{Tr}\): \[\begin{aligned} \hat{\pi}_k&=\frac{N_k}{N},\quad \hat{\mu}_k=\frac{1}{N_k}\sum_{y_i\in C_k}x_i \\ \hat{\sigma}^2&=\sum_{k=1}^K\frac{N_k-1}{N-K}\left(\frac{1}{N_k-1}\sum_{y_i\in C_k}(x_i-\hat{\mu}_k)^2\right).\end{aligned}\]

If there are \(p>1\) predictors, we can still make the Gaussian assumption, but adapted to \(\mathbb{R}^p\): \[f_{k}(\mathbf{x})= \frac{1}{(2\pi)^{p/2}\left|\boldsymbol{\Sigma}_k\right|^{1/2}}\exp\left[-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_k)^{\!\top}\boldsymbol{\Sigma}_k^{-1}(\mathbf{x}-\boldsymbol{\mu}_k),\right],\] where \(\boldsymbol{\mu}_k=(\overline{X_1},\ldots,\overline{X_p})\) and \(\boldsymbol{\Sigma}_k(j,i)=\text{Cov}(X_i,X_j)\) for all \(\vec{X}\) with \(Y=C_k\).

If we further assume that \(\sigma_k\equiv \boldsymbol{\Sigma}\) for all \(k\), then we can show that the discriminant score is, again, linear (in \(\mathbf{x}\)): \[\delta_{k;\text{LDA}}(\mathbf{x})=\mathbf{x}^{\!\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_k-\frac{1}{2}\boldsymbol{\mu}^{\!\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_k+\ln \pi_k=c_{k,0}+\mathbf{c}_k^{\!\top}\mathbf{x}.\]

We can estimate \(\boldsymbol{\mu}_k\) and \(\boldsymbol{\Sigma}\) from the data, from which we can recover the estimates \[P(Y=C_k\mid \mathbf{x})\approx \hat{p}_k(\mathbf{x})=\frac{\exp(\hat{\delta}_{k;\text{LDA}}(\mathbf{x}))}{\sum_{j=1}^K\exp(\hat{\delta}_{j;\text{LDA}}(\mathbf{x}))}.\]

The decision rule is as before: given an observation \(\mathbf{x}\in \text{Te}\), the LDA classifier with \(p>1\) is \[\hat{C}_{\text{LDA}}=C_{\arg\max_j\{\hat{\delta}_{j;LDA}(\mathbf{x})\}}.\]

Quadratic Discriminant Analysis

The assumption that the conditional probability functions be Gaussians with the same covariance in each training class may be a stretch in some situations.

If \(\boldsymbol{\Sigma}_i\neq \boldsymbol{\Sigma}_j\) for at least one pair of classes \((i,j)\), then a similar process gives rise to quadratic discriminant analysis (QDA), which reduces to discriminant scores \[\begin{aligned} \delta_{k;\text{QDA}}(\mathbf{x})&=-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\!\top}\boldsymbol{\Sigma}_k^{-1}(\mathbf{x}-\boldsymbol{\mu})+\ln \pi_k \\&=-\frac{1}{2}\mathbf{x}^{\!\top}\boldsymbol{\Sigma}_k^{-1}\mathbf{x}+\mathbf{x}^{\!\top}\boldsymbol{\Sigma}_k^{-1}\boldsymbol{\mu}_k-\frac{1}{2}\boldsymbol{\mu}^{\!\top}\boldsymbol{\Sigma}_k^{-1}\boldsymbol{\mu}_k+\ln \pi_k. \end{aligned}\] To learn the LDA model, we must estimate \(Kp+\frac{p(p+1)}{2}\) parameters from \(\text{Tr}\);228 to learn QDA, \(K\left(p+\frac{p(p+1)}{2}\right)\).229 QDA is thus more complex (and more flexible) than LDA.

The latter is recommended if \(\text{Tr}\) is small; the former if \(\text{Tr}\) is large, but LDA will yield high bias if the \(\boldsymbol{\Sigma}_k\equiv \boldsymbol{\Sigma}\) assumption is invalid. Note that LDA gives rise to nearly linear separating hypersurfaces and QDA to quadratic ones.

Gaussian Naïve Bayes Classification

If we assume further that each \(\boldsymbol{\Sigma}_k\) is diagonal (that is, if we assume that the features are independent from each other in each class), we obtain the Gaussian naïve Bayes classifier (GNBC), with discriminant scores given by \[\delta_{k;\text{GNBC}}(\mathbf{x})=-\frac{1}{2}\sum_{j=1}^p\frac{(x_j-\mu_{k,j})^2}{\sigma_{k,j}^2}+\ln \pi_k.\]

The classification process continues as before. Note that the assumption of independence is usually not met, hence the “naïve” part in the name. In spite of this, GNBC can prove very useful when \(p\) is too large, where both LDA and QDA breakdown.

Note that this approach can also be used for mixed feature vectors, by using combinations of pmf and pdf in \(f_{k,j}(x_j)\), as required. We will re-visit NBC in Section @(#CLASS-OSA-nbc).

Logistic Regression (Reprise)

We can also recast the \(2-\)class LDA model as \[\begin{aligned} \ln \left(\frac{p_0(\mathbf{x})}{1-p_0(\mathbf{x})}\right)&=\ln (p_0(\mathbf{x}))-\ln (p_1(\mathbf{x}))=\delta_0(\mathbf{x})-\delta_1(\mathbf{x})=a_0+\mathbf{a}^{\!\top}\mathbf{x}, \end{aligned}\] which has the same form as logistic regression. It is not the same model, however:

  • in logistic regression, the parameters are estimated using the maximum likelihood on \(P(Y\mid \mathbf{x})\);

  • in LDA, the parameters are estimated using the full likelihood \(P(\mathbf{x}\mid Y)P(\mathbf{x})=P(\mathbf{x},Y)\).

Example

We finish this section by giving an example of LDA and QDA on the 2011 Gapminder data (we will use the same training set \(\text{Tr}\) with \(N=116\) observations and testing set \(\text{Te}\) with \(M=50\) observations).

Given an observation \(\mathbf{x}\in \text{Te}\), we use a decision rule based on the probabilities \(\hat{p}_0(\mathbf{x}),\hat{p}_1(\mathbf{x})\) and a decision threshold \(\alpha\in (0,1)\). On the training set \(\text{Tr}\), we find:

library(dplyr)
tmp = gapminder.2011.tr |> group_by(LE.resp) |> summarise(N.k=n(),mean.im=mean(infant_mortality),mean.f=mean(fertility))

N.0 = tmp[[2]][1]
N.1 = tmp[[2]][2]
N = N.0 + N.1

mu.0 = t(matrix(cbind(tmp[[3]][1],tmp[[4]][1])))
mu.1 = t(matrix(cbind(tmp[[3]][2],tmp[[4]][2])))
mu   = (N.0*mu.0+N.1*mu.1)/N

tmp <- gapminder.2011.tr |>  
  split(gapminder.2011.tr$LE.resp) |> 
  purrr::map(select, c("infant_mortality","fertility")) |> 
  purrr::map(cov)

Sigma.0 <- tmp[[1]]
Sigma.1 <- tmp[[2]]
Sigma   <- cov(gapminder.2011.tr[,c("infant_mortality","fertility")])

so that \[\begin{aligned} N_0&=51,\ N_1=65,\ \hat{\pi}_0=51/116,\ \hat{\pi}_1=65/116,\boldsymbol{\hat{\mu}}_0&=(45.40,4.08)^{\!\top},\ \boldsymbol{\hat{\mu}}_1=(9.57,1.92)^{\!\top}\\ \boldsymbol{\Sigma}_0&=\begin{pmatrix}496.51 & 23.38 \\ 23.38 & 2.17\end{pmatrix},\ \boldsymbol{\Sigma}_1=\begin{pmatrix}42.79 & 2.14 \\ 2.14 & 0.31\end{pmatrix} \\ \boldsymbol{\hat{\mu}}&=(25.30,2.87)^{\!\top},\ \boldsymbol{\Sigma}=\begin{pmatrix}557.89 & 30.51 \\ 30.51 & 2.27\end{pmatrix}\end{aligned}\]

We invert the matrices \(\boldsymbol{\Sigma}_1, \boldsymbol{\Sigma}_2, \boldsymbol{\Sigma}\) using R’s matlib::inv(), and we obtain:

gapminder.2011.te$d.0.LDA = -4.780979901 - 0.0633308892*gapminder.2011.te$infant_mortality + 
  2.645503201*gapminder.2011.te$fertility 
gapminder.2011.te$d.1.LDA = -2.277022950 - 0.1094188053*gapminder.2011.te$infant_mortality + 
  2.313946886*gapminder.2011.te$fertility 
gapminder.2011.te$d.0.QDA = -4.657130536-0.002040555604*gapminder.2011.te$infant_mortality^2 + 
  0.00614539606*gapminder.2011.te$infant_mortality + 
  0.04390614038*gapminder.2011.te$infant_mortality*gapminder.2011.te$fertility + 
  1.811698768*gapminder.2011.te$fertility - 0.4663036203*gapminder.2011.te$fertility^2  
gapminder.2011.te$d.1.QDA = -6.700309855-0.01775013332*gapminder.2011.te$infant_mortality^2 - 
  0.1263473930*gapminder.2011.te$infant_mortality + 
  0.2427525754*gapminder.2011.te$infant_mortality*gapminder.2011.te$fertility + 
  7.005915844*gapminder.2011.te$fertility - 2.429442185*gapminder.2011.te$fertility^2 

(Is this the best way to store the LDA/QDA functions in R?)

Thus, \[\begin{aligned} \hat{\delta}_{0;\text{LDA}}&=-4.78-0.06x_1+2.65x_2 \\ \hat{\delta}_{1;\text{LDA}}&=-2.28-0.11x_1+2.31x_2 \\ \hat{\delta}_{0;\text{QDA}}&=-4.66-0.002x_1^2+0.01x_1+0.04x_1x_2+1.81x_2-0.47x_2^2 \\ \hat{\delta}_{1;\text{QDA}}&=-6.70-0.02x_1^2-0.13x_1+0.24x_1x_2+7.01x_2-2.43x_2^2 \end{aligned}\] With the class probability estimates

gapminder.2011.te$p.1.LDA = exp(gapminder.2011.te$d.1.LDA)/(exp(gapminder.2011.te$d.0.LDA)+exp(gapminder.2011.te$d.1.LDA))
gapminder.2011.te$p.1.QDA = exp(gapminder.2011.te$d.1.QDA)/(exp(gapminder.2011.te$d.0.QDA)+exp(gapminder.2011.te$d.1.QDA))

\[\begin{aligned} \hat{p}_{1;\text{LDA}}&=\frac{\exp(\hat{\delta}_{1;\text{LDA}})}{\exp(\hat{\delta}_{0;\text{LDA}})+\exp(\hat{\delta}_{1;\text{LDA}})},\\ \hat{p}_{1,\text{QDA}}&=\frac{\exp(\hat{\delta}_{1;\text{QDA}})}{\exp(\hat{\delta}_{0;\text{QDA}})+\exp(\hat{\delta}_{1;\text{QDA}})}\end{aligned}\] and the decision threshold set at \(\alpha=0.5\), the LDA and QDA life expectancy classifiers are defined on \(\text{Te}\) by \[\hat{C}_{\alpha;\text{LDA}}(\mathbf{x})=\begin{cases} 1 \text{ (high)}& \text{if $p_{1;\text{LDA}}(\mathbf{x})\geq 0.5$} \\ 0 \text{ (low)} & \text{else}\end{cases}\] and \[\hat{C}_{\alpha;\text{QDA}}(\mathbf{x})=\begin{cases} 1 \text{ (high)}& \text{if $p_{1;\text{QDA}}(\mathbf{x})\geq 0.5$} \\ 0 \text{ (low)} & \text{else}\end{cases}\]

ggpubr::ggscatter(gapminder.2011.te, x="p.1.LDA", y="LE.resp",
          shape="LE", color="LE", palette="jco", size = 3, 
          xlab="P(Y=1)", ylab = "Life Expectancy Groups", 
          title = "Gapminder 2011 Data - Testing Set Predictions - LDA")

ggpubr::ggscatter(gapminder.2011.te, x="p.1.QDA", y="LE.resp",
          shape="LE", color="LE", palette="jco", size = 3, 
          xlab="P(Y=1)", ylab = "Life Expectancy Groups", 
          title = "Gapminder 2011 Data - Testing Set Predictions - QDA")

The \(\alpha=0.5\) confusion matrices for the LDA and QDA classifiers are:

LDA.table = function(x,alpha){
  tmp = ifelse(x$p.1.LDA>alpha,1,0)
  LDA.table = table(factor(x$LE.resp, levels = 0:1),factor(tmp,levels=0:1))
} 

QDA.table = function(x,alpha){
  tmp = ifelse(x$p.1.QDA>alpha,1,0)
  QDA.table = table(factor(x$LE.resp, levels = 0:1),factor(tmp,levels=0:1))
} 

test.LDA=LDA.table(gapminder.2011.te,0.5)
test.QDA=QDA.table(gapminder.2011.te,0.5)
LDA prediction
0 1
actual 0 22 10
1 0 18
QDA prediction
0 1
actual 0 28 4
1 2 16

In the LDA case, we further have

  • \(\text{accuracy}=\frac{22+18}{22+10+0+18}=80\%\)

  • \(\text{misclassification rate}=\frac{10+0}{22+10+0+18}=20\%\)

  • \(\text{FPR}=\frac{0}{0+18}=0\%\)

  • \(\text{FNR}=\frac{10}{22+10}=31.25\%\)

  • \(\text{TPR}=\frac{22}{22+10}=68.75\%\)

  • \(\text{TNR}=\frac{18}{0+18}=100\%\)

In the QDA case:

  • \(\text{accuracy}=\frac{28+16}{28+4+2+16}=88\%\)

  • \(\text{misclassification rate}=\frac{4+2}{28+4+2+16}=12\%\)

  • \(\text{FPR}=\frac{2}{2+16}=11.1\%\)

  • \(\text{FNR}=\frac{4}{28+4}=12.5\%\)

  • \(\text{TPR}=\frac{28}{28+4}=87.5\%\)

  • \(\text{TNR}=\frac{16}{2+16}=88.9\%\)

At first glance, it would certainly seem that the QDA model performs better (at a decision threshold of \(\alpha=0.5\)), but the FPR is not ideal. What would be the ideal value of \(\alpha\)? How would we find it?

13.2.3 Receiver Operating Characteristic Curve

The receiver operating characteristic (ROC) curve plots the true positive rate against the false positive rate (in red) for classifiers obtained by varying the decision threshold \(\alpha\) in \((0,1)\).

The important realization is that classifier that is completely random would lie on the line \(\text{TPR}=\text{FPR}\) (in black). In that case, the ideal threshold would be the one associated with the model which is farthest from that line (in green).

Let \(\mathbf{u}(\alpha)\) be the vector from \(\mathbf{0}\) to the \((\text{FPR}(\alpha),\text{TPR}(\alpha))\) coordinates of the classifier with threshold \(\alpha\), and let \(\mathbf{v}(\alpha)\) be the vector through \((\text{FPR}(\alpha),\text{TPR}(\alpha))\) and perpendicular to the line \(\text{TPR}=\text{FPR}\). The ideal \(\alpha^*\) satisfies: \[\begin{aligned} \alpha^*&=\arg\max_{\alpha}\{\|\mathbf{v}(\alpha)\|\}=\arg\max_{\alpha}\{\|\mathbf{v}(\alpha)\|^2\} \\ &=\arg\max_{\alpha}\left\{\left\|\mathbf{u}(\alpha)-\text{proj}_{(1,1)}\mathbf{u}(\alpha) \right\|^2\right\} \\ &=\arg\max_{\alpha}\left\{\left\|(\text{FPR}(\alpha),\text{TPR}(\alpha))-\text{proj}_{(1,1)}(\text{FPR}(\alpha),\text{TPR}(\alpha)) \right\|^2\right\} \\ &=\arg\max_{\alpha}\left\{\left\|\left(\text{FPR}(\alpha)-\text{TPR}(\alpha), \text{TPR}(\alpha)-\text{FPR}(\alpha)\right) \right\|^2\right\} \\ &=\arg\max_{\alpha}\{(\text{FPR}(\alpha)-\text{TPR}(\alpha))^2\}.\end{aligned}\]

Example

For the LDA and QDA classifiers built with the 2011 Gapminder data (see preceding section), the false positive rates (FPR), false negative rates (FNR), true positive rates (TPR), true negative rates (TNR), and misclassification rates (MCR) when the decision threshold \(\alpha\) varies from 0.01 to 0.99 by steps of length 0.01 are:

fpr=c()
fnr=c()
tpr=c()
tnr=c()
mcr=c()

for(alpha in 1:99){
  tmp=LDA.table(gapminder.2011.te,alpha/100)
  mcr[alpha]=(tmp[1,2]+tmp[2,1])/sum(tmp)
  fpr[alpha]=tmp[2,1]/(tmp[2,1]+tmp[2,2])
  fnr[alpha]=tmp[1,2]/(tmp[1,1]+tmp[1,2])
  tnr[alpha]=tmp[2,2]/(tmp[2,1]+tmp[2,2])
  tpr[alpha]=tmp[1,1]/(tmp[1,1]+tmp[1,2])
}

plot(c(0,fpr),c(0,tpr), type = "b", pch = 21,  
     col = "red", xlim=c(0,1), ylim=c(0,1), 
     xlab="FPR",ylab="TPR", 
     main="Receiver Operating Characteristic Curve - LDA")
abline(0,1)
(index=which((fpr-tpr)^2==max((fpr-tpr)^2)))
abline(fpr[index[1]]+tpr[index[1]],-1, col="green")
ROC curve for the 2011 Gapminder LDA classifiers.

Figure 13.7: ROC curve for the 2011 Gapminder LDA classifiers.

[1] 73 74
fpr=c()
fnr=c()
tpr=c()
tnr=c()
mcr=c()

for(alpha in 1:99){
  tmp=QDA.table(gapminder.2011.te,alpha/100)
  mcr[alpha]=(tmp[1,2]+tmp[2,1])/sum(tmp)
  fpr[alpha]=tmp[2,1]/(tmp[2,1]+tmp[2,2])
  fnr[alpha]=tmp[1,2]/(tmp[1,1]+tmp[1,2])
  tnr[alpha]=tmp[2,2]/(tmp[2,1]+tmp[2,2])
  tpr[alpha]=tmp[1,1]/(tmp[1,1]+tmp[1,2])
}

plot(c(0,fpr),c(0,tpr), type = "b", pch = 21,  
     col = "red", xlim=c(0,1), ylim=c(0,1), 
     xlab="FPR",ylab="TPR", 
     main="Receiver Operating Characteristic Curve - QDA")
abline(0,1)
(index=which((fpr-tpr)^2==max((fpr-tpr)^2)))
abline(fpr[index[1]]+tpr[index[1]],-1, col="green")
ROC curve for the 2011 Gapminder QDA classifiers.

Figure 13.8: ROC curve for the 2011 Gapminder QDA classifiers.

[1] 28

In both frameworks, a number of models have the same \((\text{FPR},\text{TPR})\) coordinates. With the QDA model, the threshold is \(\alpha^*_{\text{QDA}}=0.28\) (coordinates \((0,0.844)\)); with the LDA model, the threshold is \(\alpha^*_{\text{LDA}}=0.73\) (coordinates \((0.056,0.906)\)). The corresponding confusion matrices are shown below.

\(\alpha^*_{\text{QDA}}\) prediction
0 1
actual 0 27 5
1 0 18
\(\alpha^*_{\text{LDA}}\) prediction
0 1
actual 0 29 3
1 1 17

Which model is best? It depends on the context of the task, and on the consequences of the choice. What makes the most sense here? Is there a danger of overfitting? Is parameter tuning acceptable, from a data massaging perspective? What effect does the choice of priors have?

References

[3]
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R. Springer, 2014.
[32]
R. V. Hogg and E. A. Tanis, Probability and Statistical Inference, 7th ed. Pearson/Prentice Hall, 2006.