## 4.5 Additional Topics

We will finish this module by introducing and briefly discussing some additional statistical analysis topics (ANOVA, ANCOVA, MANOVA, multivariate statistics, goodness-of-fit tests). Another common application, **linear regression and its variants**, will receive a thorough treatment in subsequent modules.

### 4.5.1 Analysis of Variance

**Analysis of variance** (ANOVA) is a statistical method that partitions a dataset’s variability into **explainable variability** (model-based) and **unexplained variability** (error) using various statistical models, to determine whether (multiple) treatment groups have significantly different group means.

The **total sample variability** of a feature \(y\) in a dataset is defined as \[\text{SS}_{\textrm{tot}}=\sum_{k=1}^{N}(y_{k}-\overline{y})^{2},\] where \(\overline{y}\) is the overall mean of the data.

Let us return to the teaching method example of Hypothesis Testing with `R`

.

The mean of the grades, for all students, is:

`[1] 77.0625`

The plot below shows all the students’ scores, ordered by participant ID; the overall mean is displayed for comparison.

Since the assignment of ID is **arbitrary** (at least, in theory), we do not observe any patterns – if we were to guess someone’s score with no knowledge except for their participant ID, then picking the sample mean is as good a guess as any other reasonable guesses.

Statistically speaking, this means that the **null model** \[y_{i,j}=\mu+\varepsilon_{i,j},\] where \(\mu\) is the **overall mean**, \(i= {A,B}\), and \(j=1,\ldots,40\), does not explain any of the variability in the student scores (as usual, \(\varepsilon_{i,j}\) represents the departure or noise from the model prediction).

But the students DID NOT all receive the same treatment: \(40\) randomly selected students were assigned to group \(A\), and the other \(40\) to group \(B\), and both group were taught using a different method.

When we add this information onto the plot, we clearly see that the two study groups show different characteristics in term of their average scores.

```
ggplot2::ggplot(teaching, ggplot2::aes(x=ID,y=Grade,colour=Group,shape=Group)) +
ggplot2::geom_point() +
ggplot2::geom_hline(ggplot2::aes(yintercept = y.bar.B),col="#00BFC4") +
ggplot2::geom_hline(ggplot2::aes(yintercept = y.bar.A),col="#F8766D") + ggplot2::theme_bw()
```

With the group assignment information, we can refine our null model into the **treatment-based model** \[y_{i,j}=\mu_{i}+\varepsilon_{i,j},\] where \(\mu_i\), \(i={A,B}\) represent the group means. Using this model, we can decompose \(\text{SS}_{\textrm{tot}}\) into **between-treatment sum of squares** and **error (within-treatment) sum of squares** as \[\begin{aligned} \text{SS}_{\textrm{tot}}&=\sum_{i,j}(y_{i,j}-\overline{y})^{2}=\sum_{i,j}(y_{i,j}-\overline{y}_{i}+\overline{y}_{i}-\overline{y})^{2}\\ &=\sum_{i}N_{i}(\overline{y}_{i}-\overline{y})^{2}+\sum_{i,j}(y_{i,j}-\overline{y}_{i})^2=\text{SS}_{\textrm{treat}}+\text{SS}_{\textrm{e}}\end{aligned}\]

The \(\text{SS}_{\textrm{treat}}\) component looks at the difference between each of the treatment means and the overall mean, which we consider to be **explainable**^{49}; the \(\text{SS}_{\textrm{e}}\) component, on the other hand, looks at the difference between each observation and its own group mean, and is considered to be **random**.^{50}

Thus, \(\text{SS}_{\textrm{treat}}/\text{SS}_{\textrm{tot}}\times 100\)% of the total variability can be explained using a treatment-based model. This ratio is called the **coefficient of determination**, denoted by \(R^{2}\).

Formally, the ANOVA table incorporates a few more items – the table below summarizes all the information that it contains.

Source |
Sum of Squares |
df |
Mean Square |
\(\mathbf{F_{0}}\) | \(\mathbf{p-}\)value |
---|---|---|---|---|---|

Treatment (Model) | \(\text{SS}_{\textrm{treat}}\) | \(p-1\) | \(\text{MS}_{\textrm{treat}}=\text{SS}_{\textrm{treat}}/(p-1)\) | \(\text{MS}_{\textrm{treat}}/\text{MS}_{\textrm{e}}\) | \(P(F_0>F^*)\) |

Error | \(\text{SS}_{\textrm{e}}\) | \(N-p\) | \(\text{MS}_{\textrm{e}}=\text{SS}_{\textrm{e}}/(N-p)\) | ||

Total | \(\text{SS}_{\textrm{tot}}\) | \(N-1\) |

The specific table for the teaching methodology dataset can be obtained directly from the `lm()`

function.

```
Analysis of Variance Table
Response: Grade
Df Sum Sq Mean Sq F value Pr(>F)
Group 1 300.31 300.312 49.276 7.227e-10 ***
Residuals 78 475.38 6.095
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Or, in a prettier format:

Source |
Sum of Squares |
df |
Mean Square |
\(\mathbf{F_{0}}\) | \(\mathbf{p-}\)value |
---|---|---|---|---|---|

Treatment (Model) | \(300.31\) | \(1\) | \(300.31\) | \(49.28\) | \(7.2\times 10^{-10}\) *** |

Error | \(475.38\) | \(78\) | \(6.095\) | ||

Total | \(775.69\) | \(79\) |

The test statistic \(F_{0}\) follows an \(F\)-distribution with \((\text{df}_{\textrm{treat}}, \text{df}_{\textrm{e}})=(1,78)\) degrees of freedom. At a significance level of \(\alpha=0.05\), the critical value \(F^*=F_{0.95, 1, 78}=3.96\) is substantially smaller than the test statistic \(F_{0}=49.28\), implying that the two-treatment model is statistically significant.

This, in turn, means that the model recognises a statistically significant difference between the students’ scores, based on the teaching methods.

`[1] 0.3871566`

The coefficient of determination \(R^2\) provides a way to measure the model’s **significance**. From the ANOVA table for the teaching example, we compute \[R^{2}=\frac{\text{SS}_{\textrm{treat}}}{\text{SS}_{\textrm{tot}}}=\frac{300.31}{775.69}\approx 0.39,\] which means that \(39\)% of the total variation in the data can be explained by the two-treatment model.

Is this good enough? That depends on the specifics of the situation (in particular, on the researcher’s or the client’s needs).

#### Diagnostic Checks

As with most statistical procedures, ANOVA relies on certain assumptions for its result to be valid. Recall that the model is given by \[y_{i,j}=\mu_{i}+\varepsilon_{i,j}.\] What assumptions are made?

The main assumption is that the error terms follow independently and identically distributed (iid) normal distributions (i.e., \(\varepsilon_{i,j}\stackrel{\text{iid}}{\sim}\mathcal{N}(0,\sigma^{2})\)).

Assuming independence, we are required to verify three additional assumptions:

normality of the error terms;

constant variance (within treatment groups), and

equal variances (across treatment groups).

Normality of the errors can be tested visually with the help of a **normal-QQ plot**, which compares the **standardized residuals quantiles** against the **theoretical quantiles** of the standard normal distribution \(\mathcal{N}(0,1)\) (a straight line indicates normality).

In other words, if the errors are normally distributed with mean \(0\) and variance \(\sigma^2\), we would expect that the \(80\) standardized residuals \(r_{i,j}=\frac{\varepsilon_{i,j}-0}{\sigma}\) should behave as though they had been drawn from \(\mathcal{N}(0,1)\).

The plots above show some departure in the lower tail, however, moderate departure from normality is usually acceptable as long as it is mostly a tail phenomenon.

To test the assumption of constant variance, we can run visual inspection using

residuals vs.fitted values, and/or

residuals vs.order/time.

The standardized residuals in both groups should be approximately distributed according to \(\mathcal{N}(0,1)\). The plots also show that variability from the mean in each treatment group is reasonably similar.^{51}

More formally, equality of variance is often tested for using **Bartlett’s test** (when normality of the residuals is met) or the **modified Levene’s test** (when it is not).

Assuming that we felt the evidence of normal residuals was warranted in the two-treatment model of the teaching dataset, we get a \(p-\)value of \(0.57\) for Bartlett’s test:

```
Bartlett test of homogeneity of variances
data: Grade by Group
Bartlett's K-squared = 0.32192, df = 1, p-value = 0.5705
```

Otherwise, we get a \(p-\)value of \(0.76\) for Levene’s test.

`(L.T <- lawstat::levene.test(teaching$Grade, teaching$Group, location="median", correction.method="zero.correction"))`

```
Modified robust Brown-Forsythe Levene-type test based on the absolute
deviations from the median with modified structural zero removal
method and correction factor
data: teaching$Grade
Test Statistic = 0.095106, p-value = 0.7586
```

In either case, the \(p-\)value falls above reasonable significance levels (0.05, say), which means that we cannot reject the null hypothesis of equal variance.

When there are \(p>2\) treatment groups, ANOVA provides a test for \[H_{0}: \mu_{1} = \cdots = \mu_{p}\quad\text{vs.}\quad H_{1}: \mu_{i} \neq \mu_{j} \text{ for at least one } i \neq j.\]

A significant \(F_0\) value indicates that **there is at least one group which differs from the others**, but it does not specify which one(s) that may be.

Specialized methods such as **Scheffe’s method** and **Tukey’s test** can be used to identify the statistically different treatments.

Finally, while ANOVA can accommodate unequal treatment group sizes, it is recommended to keep those sizes equal across all groups – this makes the test statistic less sensitive to violations of the assumption of equal variances across treatment groups, providing yet another reason to involve the analysts/consultants in the **data collection process**.

### 4.5.2 Analysis of Covariance

In a previous section, we looked at the effectiveness of new teaching method by assigning each group to a specific treatment and comparing the mean test scores. A crucial assumption for that model is that subjects in each group have **similar background knowledge** about statistics prior to the three week lectures.

If this assumption is wrong, however, we may be making incorrect decisions based on the model. Even if each group had similar background knowledge *on average*, there may be large variability from person-to-person, masking the true treatment effect.

#### Paired Comparison

One way to avoid such **subject-to-subject variability** is to administer both treatments to each individual, and then compare treatment effects by looking at the **difference in the outcomes**. For instance, if a grocery chain is interested in measuring the effectiveness of two advertising campaigns, it could be reasonable to assume that there is a large variability in total sales, as well as popular items sold, at each store.

It may then be preferable to run both campaigns in each store and analyze the resulting data rather than to split the stores into two groups (in each of which a different advertising campaign is run) and then to compare the mean outcomes in the two groups.

Formally, let \(X_{i,1}\) denote the total sales with campaign \(A\) and \(X_{i,2}\) the total sales with campaign \(B\). The quantity of interest is the **difference** \(D_{i}=X_{i,1}-X_{i,2}\) for each store \(i=1,\ldots,N\).

Assuming that the differences \(D_{i}\) follow an iid normal distribution with mean \(\delta\) and variance \(\sigma^{2}_{d}\), then we test for \[H_{0}: \delta=0\quad\mbox{against}\quad H_{1}: \delta \neq 0\] using the test statistic \[t_0=\sqrt{N}\frac{\overline{D}}{s_{d}},\] which follows a Student’s \(t\) distribution with \(N-1\) degrees of freedom; thus we reject \(H_{0}\) if the observed test statistic \(t_{0}\) has \(p\)-value less than the significance level \(\alpha/2\).

#### ANOVA vs. ANCOVA

ANOVA compares multiple group means and tests whether any of the group means differ from the rest, by breaking down the total variability into a treatment (explainable) variability component and an error (unexplained) variability component, and building a ratio \(F_0\) to determine whether or not to reject \(H_{0}\).

**Analysis of covariance** (ANCOVA) introduces **concomitant variables** (or **covariates**) to the ANOVA model, splitting the total variability into 3 components: \(\text{SS}_{\textrm{treat}}\), \(\text{SS}_{\textrm{con}}\), and \(\text{SS}_{\textrm{e}}\), aiming to reduce error variability.

The choice of covariates is thus crucial in running a successful ANCOVA. In order to be useful, a concomitant variable must be related to response variable in some way, otherwise it not only fails to reduce error variability, but it also increases the model complexity:

in the teaching method example, we could consider administering a pre-study test to measure the

**prior knowledge level**of each participant and use this score as a concomitant variable;in the advertising campaign example, we could have used the

**previous month’s sales**as a covariate;in medical studies, we could use the

**age**and**weight**of subjects, say.

Importantly, concomitant variables should not be affected by treatments. As an example, suppose that the patients in a medical study were asked:

How strongly do you believe that you were given actual medication rather than a placebo?

If the treatment is indeed effective, then a participant’s response to this question could be **markedly different** in the treatment group than in the placebo group.^{52}

This means that true treatment effect may be masked by concomitant variable due to unequal effects on treatment groups. Note that **qualitative covariates** (such as gender, say) are not part of the ANCOVA framework – indeed, such covariates create new ANOVA treatment groups instead.

When moving from an ANOVA to an ANCOVA model, the error variability is further split into a **pure error** and a **covariate** component, while the **treatment** variability remains unchanged.

#### ANCOVA Model and Assumptions

Suppose that we are testing the effect of \(p\) treatments, with \(N_{j}\) subjects in each group. Then the ANCOVA model takes the form \[y_{i,j}=\mu+\tau_{j}+\gamma (x_{i,j}-\overline{x})+\varepsilon_{i,j}\] where

\(y_{i,j}\) is the response of the \(i^{\text{th}}\) subject in the \(j^{\text{th}}\) treatment group;

\(\mu\) is the overall mean;

\(\tau_{j}\) is the \(j^{\text{th}}\) treatment effect, subject to a constraint \[\sum_{j=1}^{p}\tau_{j}=0;\]

\(\gamma\) is the coefficient for the

**covariate effect**;\((x_{i,j}-\overline{x})\) is the covariate value of the \(i^{\text{th}}\) subject in the \(j^{\text{th}}\) treatment group, adjusted by the mean, and

\(\varepsilon_{i,j}\) is the error of \(i^{\text{th}}\) subject in the \(j^{\text{th}}\) treatment group.

Additionally, four assumptions must be satisfied:

**independence and normality of residuals**– the residuals follow an \({iid}\) normal distribution with mean of \(0\) and variance \(\sigma^{2}_{\varepsilon}\);**homogeneity of residual variances**– the variance of the residuals is uniform across treatment groups;**homogeneity of regression slopes**– the regression effect (slope) is uniform across treatment groups, and**linearity of regression**– the regression relationship between the response and the covariate is linear.

The first of these assumptions can be tested with the help of a QQ-plot and a scatter-plot of residuals vs.fitted values, while the second may use the Bartlett or the Levene test. The final assumption is not as crucial as the other three assumptions, however. Various remedial methods can be applied should any of these assumptions fail.

The third assumption, however, is **crucial** to the ANCOVA model; it can be tested with the **equal slope test**, which requires an ANCOVA regression with an additional interaction term \(x \times \tau\). If the interaction is not significant, the third assumption is satisfied.

In the event that the interaction term is statistically significant, a different approach (e.g. moderated regression analysis, mediation analysis) is required since using the original ANCOVA model is not prescribed.

An in-depth application of an ANCOVA model is highlighted in Case Study: IBS.

### 4.5.3 Basics of Multivariate Statistics

To this point, we have only considering situations where the response has been **univariate**. In applications, the situation often calls for **multivariate** responses, where the response variables are thought to have some relationship to one another (e.g., a **correlation structure**).

It remains possible to analyze each response variable independently, but the dependence structure can be exploited to make **joint** (or simultaneous) inferences.

#### Properties of the Multivariate Normal Distribution

The probability density function of a multi-dimensional random vector \(\mathbf{X}\in\mathbb{R}^p\) that follows a **multivariate normal distribution** with **mean vector** \(\mathbf{\mu}\) and **covariance matrix** \(\mathbf{\Sigma}\), denoted by \(\mathbf{X}\sim \mathcal{N}_p(\mathbf{\mu},\mathbf{\Sigma})\), is given by \[f(\mathbf{X})=\frac{1}{(2\pi)^{p/2}\det(\mathbf{\Sigma})^{1/2}}\exp\left(-\frac{1}{2}(\mathbf{X}-\mathbf{\mu})^{\!\top}\mathbf{\Sigma}^{-1}(\mathbf{X}-\mathbf{\mu})\right),\] where \[\mathbf{\Sigma}=\begin{bmatrix} \sigma_{1,1} & \sigma_{1,2} & \cdots & \sigma_{1,p}\\ \sigma_{2,1} & \sigma_{2,2} & \cdots & \sigma_{2,p}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma_{p,1} & \sigma_{p,2} & \cdots & \sigma_{p,p}\\ \end{bmatrix}.\]

For such an \(\mathbf{X}\), the following properties hold:

any linear combination of its components are normally distributed;

all subsets of components follow a (modified) multivariate normal distribution;

a diagonal covariance matrix implies the independence of its components;

conditional distributions of components follow a normal distribution, and

the quantity \((\mathbf{X}-\mathbf{\mu})^{\!\top}\mathbf{\Sigma}^{-1}(\mathbf{X}-\mathbf{\mu})\) follows a \(\chi^{2}_{p}\).

These properties make the multivariate normal distribution attractive, from a theoretical point of view (if not always entirely realistic).

For instance:

using property 1, we can use

**contrasts**to test which components are distinct from the others;property 5 is the multivariate analogue of the square of a standard normal random variable \(Z\sim \mathcal{N}(0,1)\) following a \(Z^2\sim \chi^2_1\) distribution;

but two univariate normal random variables with zero covariance are not necessarly independent (the joint p.d.f. of two such variables is not necessarily the p.d.f. of a multivariate normal distribution).

#### Hypothesis Testing for Mean Vectors

When the sample comes from a univariate normal distribution, we can test \[H_{0}: \mu=\mu_{0}\quad\mbox{against}\quad H_{1}: \mu \neq \mu_{0}\] by using a \(t-\)statistic. Analogously, if the sample comes from a \(p-\)variate normal distribution, we can test \[H_{0}: \mathbf{\mu}=\mathbf{\mu_{0}}\quad\mbox{against}\quad H_{1}: \mathbf{\mu} \neq \mathbf{\mu_{0}}\] by using **Hotelling’s \(\mathbf{T}^2\) test statistic** \[T^{2}=N\cdot (\mathbf{\overline{X}}-\mathbf{\mu})^{\!\top}\mathbf{S}^{-1}(\mathbf{\overline{X}}-\mathbf{\mu}),\] where \(\mathbf{\overline{X}}\) denotes the **sample mean**, \(\mathbf{S}\) the **sample covariance matrix**, and \(N\) the sample size.

Under \(H_{0}\), \[T^{2}\sim \frac{(N-1)p}{(N-p)}F_{p, N-p}.\] Thus, we do not reject \(H_{0}\) at a significance level of \(\alpha\) if \[N\cdot (\mathbf{\overline{X}}-\mathbf{\mu_{0}})^{\!\top}\mathbf{S}^{-1}(\mathbf{\overline{X}}-\mathbf{\mu_{0}}) \leq \frac{(N-1)p}{(N-p)}F_{p, N-p}(\alpha)\] and reject it otherwise.

#### Confidence Region and Simultaneous Confidence Intervals for Mean Vectors

In the \(p-\)variate normal distribution, any \(\mathbf{\mu}\) that satisfies the condition \[N\cdot (\mathbf{\overline{X}}-\mathbf{\mu})^{\!\top}\mathbf{S}^{-1}(\mathbf{\overline{X}}-\mathbf{\mu}) \leq \frac{(N-1)p}{(N-p)}F_{p, N-p}(\alpha)\] resides inside a \((1-\alpha)100\)% **confidence region** (an ellipsoid in this case).

**Simultaneous Bonferroni confidence intervals** with overall error rate \(\alpha\) can also be derived, using \[(\overline{x}_{j}-\mu_{j})\pm t_{N-1}(\alpha/p)\sqrt{\frac{s_{j,j}}{N}} \text{ for }j=1,\ldots, p.\]

Another approach is to use **Hotelling’s \(\mathbf{T}^2\) simultaneous confidence intervals**, given by \[(\overline{x}_{j}-\mu_{j})\pm \sqrt{\frac{p(N-1)}{N-p}F_{p,N-p}(\alpha)} \sqrt{\frac{s_{j,j}}{N}} \text{ for }j=1,\ldots, p.\]

Figure @(fig:testA7) shows these regions for a bivariate normal random sample. Note that the Hotelling’s \({T}^{2}\) simultaneous confidence intervals form a rectangle (in grey) that confines the confidence region, while the Bonferroni confidence intervals (in blue) are slightly narrower.

Given that all the components of the mean vector are correlated (since the covariance matrix is generally non-diagonal), the confidence region should be used if the goal is to study the **plausibility of the mean vector as a whole**, while Bonferroni confidence intervals may be more suitable when **component-wise confidence intervals** are of needed.

#### Multivariate Analysis of Variance

ANOVA is often used as a first attempt to determine whether the means from every sub-population are identical.

ANOVA can test means from more than two populations; the **multivariate ANOVA** (MANOVA) is quite simply a multivariate extension of ANOVA which tests whether the mean vectors from all sub-populations are identical.

Assume there are \(I\) sub-populations in the population, from each of which \(N_i\) \(p-\)dimensional responses are drawn, for \(i=1,\ldots,I\).

Each observation can be expressed as: \[\mathbf{X}_{i,j}=\mathbf{\mu}+\mathbf{\tau}_{i}+\mathbf{\varepsilon}_{ij},\] where \(\mathbf{\mu}\) is the **overall mean vector**, \(\mathbf{\tau}_{i}\) is the \(i^{\text{th}}\) **population-specific treatment effect**, and \(\mathbf{\varepsilon}_{ij}\) is the **random error**, which follows a \(N_{p}(\mathbf{0},\mathbf{\Sigma})\) distribution.

It is important to note that the covariance matrix \(\mathbf{\Sigma}\) is assumed to be the same for each sub-population, and that \[\sum_{i=1}^{I}N_{i}\mathbf{\tau}_{i}=\mathbf{0}\] to ensure that the estimates are uniquely identifiable.

To test the hypothesis \[H_{0}: \mathbf{\tau}_{1}=\cdots=\mathbf{\tau}_{I}=\mathbf{0}\quad\mbox{against}\quad H_{1}: \text{some } \mathbf{\tau}_{i}\neq \mathbf{0},\] we decompose the **total sum of squares and cross-products** \(\textrm{SSP}_{\textrm{tot}}\) into \[\textrm{SSP}_{\textrm{tot}}=\textrm{SSP}_{\textrm{treat}}+\textrm{SS }_{\textrm{e}}.\]

Based on this decomposition, we compute the test statistic known as **Wilks’ lambda** \[\Lambda^{*}=\frac{|\mathbf{W}|}{|\mathbf{B}+\mathbf{W}|},\] where \(\mathbf{B},\mathbf{W}\) are as in the MANOVA table below:

Source |
SSP |
df |
MSP |
\(\mathbf{F_0}\) |
---|---|---|---|---|

Treatment | \(\mathbf{B}\) | \(I-1\) | \(\mathbf{B}/(I-1)\) | \(\mathbf{W}^{-1}\mathbf{B}\) |

Error | \(\mathbf{W}\) | \(\sum_{i=1}^{I}N_{i}-I\) | \(\mathbf{W}/\sum_{i=1}^I(N_i-1)\) | |

Total | \(\mathbf{B+W}\) | \(\sum_{i=1}^{I}N_{i}-1\) | \((\mathbf{B+W})/(\sum_{i=1}^{I}N_{i}-1)\) |

We have \[\mathbf{B}=\sum_{i=1}^{I}N_{i}(\mathbf{\bar{X}}_{i}-\mathbf{\bar{X}})(\mathbf{\bar{X}}_{i}-\mathbf{\bar{X}})^{\!\top}\] and \[\mathbf{W}=\sum_{i=1}^{I}\sum_{j=1}^{n_{i}}(\mathbf{X}_{ij}-\mathbf{\bar{X}}_{i})(\mathbf{X}_{ij}-\mathbf{\bar{X}}_{i})^{\!\top};\] we reject \(H_{0}\) if \(\Lambda^{*}\) is below some preagreed upon threshold, which depends on \(p\), \(I\), and \(N_i\), \(i=1,\ldots, I\).

### 4.5.4 Goodness-of-Test Fits

A (fictitious) 2017 survey asked a sample of \(N=200\) adults between the age of 25 to 35 about their highest educational achievement. The result is summarized below.

Year |
\(\mathbf{<}\)HS |
HS |
CU |
CU\(\mathbf{+}\) |
---|---|---|---|---|

\(2017\) | \(16\) | \(55\) | \(83\) | \(46\) |

In a 1997 survey, it was also found that

\(p_1=13\)% of adults had not completed high school;

\(p_2=32\)% had obtained a high school degree but not a post-secondary degree;

\(p_3=37\)% had either an undergraduate college or university diploma but no post-graduate degree, and

\(p_4=18\)% had at least one post-graduate degree in 19977.

Based on the result of this survey, is there sufficient evidence to believe that educational backgrounds of the population have changed between 1997 and 2007?

Since each respondent’s educational achievement can only be classified into one of these categories, they are **mutually exclusive**. Furthermore, these categories cover all possibilities on the educational front, so they are also **exhaustive**.

We can thus view the distribution of educational achievements as being **multinomial**. For such a distribution, with parameters \(p_{1},\cdots,p_{k}\), the expected frequency in each category is \(m_{j}=Np_{j}\).

Let \(O_{j}\) denote the observed frequency for the \(j^{\text{th}}\) category. If there has been no real change since 1997, we would expect the sum of squared differences between the observed 2017 frequencies and the expected frequencies based on 1997 data to be small.

We can use this information to test the **goodness-of-fit** between the observations and the expected frequencies *via* Pearson’s \(\chi^{2}\) test statistic \[X^{2}=\sum_{j=1}^{k}\frac{(O_{j}-m_{j,0})^{2}}{m_{j,0}}\] which follows a \(\chi^{2}\) distribution with \(k-1\) df.

In the above example, the hypotheses of interest are \[H_{0}: \mathbf{p}=\mathbf{p}^*=(0.13,0.32,0.37,0.18)\quad\mbox{vs}\quad H_{1}: \mathbf{p}\neq \mathbf{p}^*.\]

The table below summarizes the information under \(H_{0}\).

Category |
\({O}_{j}\) | \({p}_{j,0}\) | \({m}_{j,0}\) | \(({O}_{j}-{m}_{j,0})^2/{m}_{j,0}\) |
---|---|---|---|---|

\(1\) | \(16\) | \(0.13\) | \(26\) | \(3.846\) |

\(2\) | \(55\) | \(0.32\) | \(64\) | \(1.266\) |

\(3\) | \(83\) | \(0.37\) | \(74\) | \(1.095\) |

\(4\) | \(46\) | \(0.18\) | \(36\) | \(2.778\) |

Total | \(200\) | \(1\) | \(200\) | \(7.815\) |

Pearson’s test statistic is \(X^{2}=7.815\), with an associated \(p-\)value of \(0.0295\), which implies that there is enough statistical evidence (at the \(\alpha=0.05\) level) to accept that the population’s educational achievements have changed over the last 20 years.