4.4 Hypothesis Testing
Consider the following scenario: person A claims they have a fair coin, but for some reason, person B is suspicious of the claim, believing the coin to be biased in favour of tails.
Person B flips the coin \(10\) times, expecting a low number of heads, which they intend to use as evidence against the claim. Let \(X=\) # of Heads.
Suppose \(X=4\). This is less than expected for a binomial random variable \(X\sim \mathcal{B}(10,0.5)\) since \(\text{E}[X]=5\); the results are more in line with a coin for which \(P(\text{Head})=0.4\).
Does this data really constitute evidence against the claim \(P(\text{Head})=0.5\)?
If the coin is fair, then \(X \sim \mathcal{B}(10,0.5)\) and \(X=4\) is still close to \(\text{E}[X]\); in fact, \(P(X=4)=0.205\) (as opposed to \(P(X=5)=0.246\)) so the event \(X=4\) is still quite likely. It would seem that there is no real evidence against the claim that the coin is fair.
The way the sentence “It would seem that there is no evidence against the claim that the coin is fair” is worded is very important.
We did not reject the claim that \(P(\text{Head})=0.5\) (i.e. that the coin is symmetric), but this doesn’t mean that, in fact, \(P(\text{Head})=0.5\). Not rejecting (which is not quite the same as “accepting”) a claim is a weak statement.
To see why, let’s consider person C, who claims that the coin from the example above has \(P(\text{Head})=0.3\). Under \(X\sim \mathcal{B}(10,0.3)\), the event \(X=4\) is still quite likely, with \(P(X=4)=0.22\); we do not have enough evidence to reject either \(P(\text{Head})=0.5\) or \(P(\text{Head})=0.3\).
However, rejecting a claim is a strong statement! Let’s say that person B convinces person A to flip the coin another \(90\) times. In the second round of flips, \(36\) Heads occur, giving a total of \(40\) Heads out of \(100\) coin flips.
What can we say now? Does this constitute any evidence against the claim? If so, how much?
Let \(Y\sim \mathcal{B}(100,0.5)\) (i.e.the coin is fair); \(Y=40\) is smaller than what we would expect as \(\text{E}[Y]=50\) if the claim is true, so \(Y=40\) is again more in agreement with \(P(\text{Head})=0.4\).
But the event \(Y=40\) does not lie in the probability mass centre of the distribution as \(X=4\) did; rather, it falls in the distribution tail (an area of lower probability).
For \(Y\sim \mathcal{B}(100,0.5)\), \(P(Y=40)=0.011\) (compare this with the previous value \(P(X=4)=0.205\)). Thus, if the coin is fair, the event \(Y=40\) is quite unlikely.
Values down in the lower tail (or up in the upper tail) provide some evidence against the claim. The question is, how much evidence? How do we quantify it?
Since values that are “further down the left tail” provide evidence against the claim of a fair coin (in favour of a coin biased against Heads), we will use the actual tail area that goes with the observation: the smaller the tail area, the greater the evidence against the claim (and vice-versa).
For \(4\) Heads out of \(10\) tosses, the evidence is the \(p-\)value \(P(X\leq4)\) , i.e. \[P(X\leq 4\mid X\sim\mathcal{B}(10,0.5))=0.377.\] Thus, if \(P(\text{Head})=0.5\), the event \(X\le 4\) is still very likely: we would see evidence that extreme (or more) \(\approx 38\)% of the time (simply by chance).
For \(40\) Heads out of \(100\) tosses, the evidence is the \(p-\)value \(P(Y\leq40)\) , i.e. \[P(Y\leq 40\mid Y\sim\mathcal{B}(100,0.5))=0.028.\] Thus, if \(P(\text{Head})=0.5\), the event \(Y\le 40\) is very unlikely: we would only see evidence that extreme (or more) \(\approx 3\)% of the time. A claim’s \(p-\)value is the area of the tail of the distribution’s p.d.f. under the assumption that the claim is true: \[\text{{smaller $p-$value}} \Longleftrightarrow \text{{more evidence against claim}}.\]
Vocabulary of Hypothesis Testing
A specific language and notation has evolved to describe this approach to “testing hypotheses”:
the “claim” is called the null hypothesis and is denoted by \(H_0\).
the “suspicion” is called the alternative hypothesis and is denoted by \(H_1\);
the (random) quantity we use to measure evidence is called a test statistic – we need to know its distribution when \(H_0\) is true, and
the \(p-\)value quantifies “the evidence against \(H_0\)”.
Consider the coin tossing situation described previously. The null hypothesis is \[H_0: P(\text{Head})=0.5\,.\] The alternative hypothesis is \[H_1: P(\text{Head})<0.5\,.\] The coin is tossed \(n\) times; the test statistic is the number of heads \(X\) in \(n\) tosses.
If \(n=10\) and \(X=4\), the \(p-\)value is \[P(X\leq 4\mid X\sim \mathcal{B}(10,0.5))=0.377,\] on the basis of which we would not reject the null hypothesis that the coin was fair.
If \(n=100\) and \(X=40\), the \(p-\)value is \[P(X\leq 40\mid X\sim \mathcal{B}(100,0.5))=0.028,\] on the basis of which we would reject the null hypothesis that the coin was fair, in favour of the alternative that it was not.
How Small Does the \(p-\)Value Need to Be?
We concluded that \(37.7\)% was “not that small”, whereas \(2.8\)% was “small enough”. How small does a \(p-\)value need to be before we consider that we have “compelling evidence” against \(H_0\)?
There is no easy answer to this question. It depends on many factors, including what penalties we might pay for being wrong.
Typically, we look at the probability of making a type I error, \(\alpha=P(\text{reject }H_0\mid H_0\; \text{ is true}):\)
if \(p-\)value \(\le \alpha\), then we reject \(H_0\) in favour of \(H_1\);
if \(p-\)value \(> \alpha\), then there is not enough evidence to reject \(H_0\) (which is not the same as accepting \(H_0\)!).
By convention, we often use \(\alpha=0.01\) or \(\alpha=0.05\).
The use of \(p\)-values has come under fire recently, as many view them as the root cause of the current replication crisis.^{45}
In this twitter thread, K. Carr describes why there is nothing wrong with \(p-\)values per se:
Don’t know what a \(p-\)VALUE is?
Don’t know why \(p-\)VALUES work?
Don’t know why sometimes \(p-\)VALUES don’t work?
THIS IS THE THREAD FOR YOU!
DEFINITION OF A \(p-\)VALUE: Assume your theory is false. The \(p-\)VALUE is the probability of getting an outcome as extreme or even more extreme than what you got in your experiment.
THE LOGIC OF THE \(p-\)VALUE: Assume my theory is false. The probability of getting extreme results should be very small but I got an extreme result in my experiment. Therefore, I conclude that this is strong evidence that my theory is true. That’s the logic of the p-value.
THE \(p-\)VALUE IS REASONABLE IN THEORY BUT TRICKY IN PRACTICE: In my opinion, the p-value is just a mathematical version of the way humans think. If we see something that seems unlikely given our beliefs, we often doubt those beliefs. In practice, the p-value can be tricky to use.
THE \(p-\)VALUE REQUIRES A GOOD DEFINITION OF WHEN YOUR THEORY IS FALSE: There are usually an infinite number of ways to define a world where your theory is false. \(p-\)values often fail when people use overly simplistic mathematical models of the processes that created their data. If the mismatch between their mathematical models of the world and the actual world is too large then the probabilities we compute can become completely disconnected from reality.
THE \(p-\)VALUE MAY REQUIRE AN ACCURATE MODEL OF YOU (THE OBSERVER): The probability of getting the result you got depends on many things. If you sometimes do things like throw out data or repeat measurements then you’re part of the system. Your behavior affects the probability of getting your experimental results. Therefore, to be completely realistic, you need to have an ACCURATE model of your own behavior when you gather and analyze data. This is hard and a big part of why the p-value often fails as a tool.
BY DEFINITION, \(p-\)VALUES MUST SOMETIMES BE WRONG: When using p-values, we’re working off of probabilities. By logic of the p-value itself, even with perfect use, some of your decisions will be wrong. You have to embrace this if you’re going to use the p-values. Badly defining what it means for your model to be false. Inaccurately modeling the chances of getting your data including your own behaviors. Not treating a p-value as a decision rule that can sometimes be wrong.
These factors all contribute to misuse of the p-value in practice. Hope this cleared some things up for you.
Thanks for coming to my p-value TED talk!
4.4.1 Hypothesis Testing in General
A hypothesis is a conjecture concerning the value of a population parameter.
Hypothesis testing require two competing hypotheses:
a null hypothesis, denoted by \(H_0\);
an alternative hypothesis, denoted by \(H_1\) or \(H_A\).
The hypothesis is tested by evaluating experimental evidence:
if the evidence against \(H_0\) is strong enough, we reject \(H_0\) in favour of \(H_1\), and we say that the evidence against \(H_0\) in favour of \(H_1\) is significant;
if the evidence against \(H_0\) is not strong enough, then we fail to reject \(H_0\) and we say that the evidence against \(H_0\) is not significant.
In cases when we fail to reject \(H_0\), we do NOT instead accept \(H_0\); we simply do not have enough evidence to reject \(H_0\).
From a philosophical perspective, the hypotheses should be formulated prior to the experiment or the study. The experiment or study is then conducted to evaluate the evidence against the null hypothesis – in order to avoid data snooping, it is crucial that we do not formulate \(H_1\) after looking at the data.
Scientific hypotheses can be often expressed in terms of whether an effect is found in the data. In this case, we might use the following null hypothesis: \[H_0: \mbox{there is no effect}\] against the alternative hypothesis: \[H_1: \mbox{there is an effect}.\]
Errors in Hypothesis Testing
Two types of errors can be committed when testing \(H_0\) against \(H_1\):
if we reject \(H_0\) when \(H_0\) was in fact true, we have committed a type I error;
if we fail to reject \(H_0\) when \(H_0\) was in fact is false, we have committed a type II error.
Decision: | Decision: | |
reject \(H_0\) | fail to reject \(H_0\) | |
Reality: \(H_0\) is True | Type I Error | No Error |
Reality: \(H_0\) is False | No Error | Type II Error |
Examples:
If we conclude that a drug treatment is useful for treating a particular disease, but this is not the case in reality, then we have committed an error of type I.
If we cannot conclude that a drug treatment is useful for treating a particular disease, but in reality the treatment is effective, then we have committed an error of type II.
What type of error is worst? It depends on numerous factors.
Power of a Test
The probability of committing a type I error is usually denoted by \[\alpha =P(\text{reject }H_0\mid H_0\; \text{ is true});\] that of committing a type II error by \[\beta =P(\text{ fail to reject }H_0\mid H_0\; \text{ is false}),\] and that of correctly rejecting \(H_0\) by \[\text{power} =P(\text{reject }H_0\mid H_0\; \text{ is false})=1-\beta.\] Conventional values of \(\alpha\) and \(\beta\) are usually \(0.05\) and \(0.2\), respectively, although that is not a hard and fast rule.
Types of Null and Alternative Hypotheses
Let \(\mu\) be the population parameter of interest; hypotheses are usually expressed in terms of the values of this parameter (although we could also be testing for other parameters).
The null hypothesis is a simple hypothesis of the form: \[H_0: \mu=\mu_0,\] where \(\mu_0\) is some candidate value (“simple” means that the parameter is assumed to take on a single value).
The alternative hypothesis \(H_1\) is a composite hypothesis, i.e. it contains more than one candidate value.
Depending on the context, hypothesis testing takes on one of the following three forms: \[H_0: \mu=\mu_0, ~~ \mbox{ where $\mu_0$ is a number},\] against a:
two-sided alternative: \(H_1: \mu\neq \mu_0;\)
left-sided alternative: \(H_1: \mu < \mu_0, \text{ or}\)
right-sided alternative: \(H_1: \mu > \mu_0.\)
The formulation of the alternative hypothesis depends on the research hypothesis and is determined prior to experiment or study.
Example: investigators often want to verify if new experimental conditions lead to a change in population parameters.
For instance, an investigator claims that the use of a new type of soil will produce taller plants on average compared to the use of traditional soil. The mean plant height under the use of traditional soil is \(20\) cm.
Formulate the hypotheses to be tested.
If another investigator suspects the opposite, that is, that the mean plant height when using the new soil will be smaller than the mean plant height with old soil. What hypotheses should be formulated?
A 3rd investigator believes that there will be an effect, but is not sure if the effect with be to produce shorter or taller plants. What hypotheses should be formulated then?
Answer: let \(\mu\) represent the mean plant height with the new type of soil. In all three cases, the null hypothesis is \(H_0: \mu=20\).
The alternative hypothesis depends on the situation:
\(H_1: \mu>20\).
\(H_1: \mu<20\).
\(H_1: \mu\neq 20\).
For each \(H_1\), the corresponding \(p-\)values would be computed differently when testing \(H_0\) against \(H_1\).
4.4.2 Test Statistics and Critical Regions
We test a statistical hypothesis we use a test statistic. A test statistic is a function of the random sample and the population parameter of interest.
In general, we reject \(H_0\) if the value of the test statistic is in the critical region or rejection area for the test; the critical region is an interval of real numbers.
The critical region is obtained using the definition of errors in hypothesis testing – we select the critical region so that \[\alpha =P(\text{reject }H_0\mid H_0\; \text{ is true})\] is equal to some pre-determined value, such as \(0.05\) or \(0.01\).
Examples: a new curing process developed for a certain type of cement results in a mean compressive strength of \(5000\) kg/cm\(^2\), with a standard deviation of \(120\) kg/cm\(^2\).
We test the hypothesis \(H_0: \mu = 5000\) against the alternative \(H_1: \mu < 5000\) with a random sample of \(49\) pieces of cement. Assume that the critical region in this specific instance is \(\overline{X} < 4970\), that is, we would reject \(H_0\) if \(\overline{X}<4970\).
Find the probability of committing a type I error when \(H_0\) is true.
Answer: by definition, we have \[\begin{aligned} \alpha&=P(\text{{type I error}})=P(\text{reject } H_0\mid H_0\; \text{ is true})\\&=P(\overline{X}<4970\mid \mu=5000).\end{aligned}\] Thus, according to the CLT, we have \[\begin{aligned} \alpha&\approx P\left(\frac{ \overline{X}-\mu}{\sigma/\sqrt{n}}<\frac{4970-5000}{120/7}\right)\\& \approx P(Z<-1.75)\approx 0.0401\,.\end{aligned}\]
The sampling distribution of \(\overline{X}\) under \(H_0\) is shown in red in the graph above (and those below): it is a normal distribution with mean \(=5000\), and standard deviation \(=120/7\)). The sampling distribution of \(\overline{X}\) under \(H_1\) appears in blue: here, a normal distribution with mean \(=4990\) and standard deviation \(=120/7\).
The critical region falls to the left of the vertical black line \(\overline{X}<4970\), and the probability of committing a type I error is the area shaded in pale red, below: \[\begin{aligned} \alpha&=P(\text{reject } H_0\mid H_0\; \text{ is true})\\&=P(\overline{X}<4970\mid \mu=5000).\end{aligned}\]
We would thus reject \(H_0\) if the observed value of \(\overline{X}\) falls to the left of \(\overline{X}=4970\) (in the critical region).
Evaluate the probability of committing a type II error if \(\mu\) is actually \(4990\), say (and not \(5000\), as assumed in \(H_0\)).
Answer: by definition, we have \[\begin{aligned} \beta&=P(\text{{type II error}})=P(\text{fail to reject } H_0\mid H_0\; \text{ is false})\\ &=P(\overline{X}>4970\mid \mu=4990).\end{aligned}\] Thus, according to the CLT, we have \[\begin{aligned} \beta&= P( \overline{X} > 4970)=P\left(\frac{ \overline{X}-\mu}{\sigma/\sqrt{n}}>\frac{4970-4990}{120/7}\right)\\& \approx P(Z>-1.17)=1-P(Z<-1.17)\approx 0.879\,.\end{aligned}\] The critical region falls to the the right of the vertical black line, and the probability of committing a type II error is the area shaded in pale blue: \[\begin{aligned} \beta&=P(\text{fail to reject } H_0\mid H_0\; \text{ is false})\\&=P(\overline{X}>4970\mid \mu=4990).\end{aligned}\]
We would thus fail to reject \(H_0\) if the observed vale of \(\overline{X}\) falls to the right of \(\overline{X}=4970\) (outside the critical region).
The power of the test is easily computed as \[\begin{aligned} \text{power} &=P(\text{reject }H_0\mid H_0\; \text{ is false})\\& =P(\overline{X}<4970)=1-\beta\approx 0.121,\end{aligned}\] the area shaded in grey below.
Evaluate the probability of committing a type II error if \(\mu\) is actually \(4950\), say (and not \(5000\), as in \(H_0\)).
Answer: by definition, we have \[\begin{aligned} \beta&=P(\text{{type II error}})\\& =P(\text{fail to reject } H_0\mid H_0\; \text{ is false})\\ &=P(\overline{X}>4970|\mu=4950).\end{aligned}\] Thus, according to the CLT, we have \[\begin{aligned} \beta& =P\left(\frac{ \overline{X}-\mu}{\sigma/\sqrt{n}}>\frac{4970-4950}{120/7}\right)\\& \approx P(Z>1.17)\approx 0.121\,.\end{aligned}\] The critical region falls to the the right of the vertical black line, and the probability of committing a type II error is the area shaded in pale blue: \[\begin{aligned} \beta&=P(\text{fail to reject } H_0\mid H_0\; \text{ is false})\\&=P(\overline{X}>4970\mid \mu=4950).\end{aligned}\]
We would thus fail to reject \(H_0\) if the observed value of \(\overline{X}\) falls to the right of \(\overline{X}=4970\) (outside the critical region).
The probability of making a type II error is substantially larger in the first case, which means that the threshold \(\overline{X}=4970\) is not ideal in that situation.
4.4.3 Test for a Mean
Suppose \(X_1,\ldots,X_n\) is a random sample from a population with mean \(\mu\) and variance \(\sigma^2\), and let \(\overline{X}=\frac1n\sum_{i=1}^n X_i\) denote the sample mean.
We have seen that:
if the population is normal, then \(\overline{X} \stackrel{\text{exact}} \sim\mathcal{N}(\mu,\sigma^2/n)\,;\)
if the population is not normal, then as long as \(n\) is large enough, \(\overline{X}\stackrel{\text{approx}}\sim\mathcal{N}(\mu,\sigma^2/n)\).
In this section, we start by assuming that the population variance \(\sigma^2\) is known, and that the hypothesis concerns the unknown population mean \(\mu\).
Explanation: Left-Sided Alternative
Consider the unknown population mean \(\mu\). Suppose that we would like to test \[H_0: \mu=\mu_0 ~~~ \mbox{ against } ~~~ H_1: \mu<\mu_0,\] where \(\mu_0\) is some candidate value for \(\mu\).
To evaluate the evidence against \(H_0\), we compare \(\overline{X}\) to \(\mu_0\). Under \(H_0\), \[Z_0=\frac{ \overline{X}-\mu_0}{\sigma/\sqrt{n}}\stackrel{\text{approx}}{\sim} \mathcal{N}(0,1).\] We say that \(z_0=\frac{\overline{x}-\mu_0}{\sigma/\sqrt{n}}\) is the observed value of the \(Z-\)test statistic \(Z_0\).
If \(z_0<0\), we have evidence that \(\mu<\mu_0\). However, we only reject \(H_0\) in favour of \(H_1\) if the evidence is significant, which is to say, if \[z_0\leq -z_\alpha, ~\text{at a level of significance }\alpha.\] The corresponding \(p-\)value for this test is the probability of observing evidence that is as (or more) extreme than our current evidence in favour of \(H_1\), assuming that \(H_0\) is true (that is, simply by chance).^{46}
The decision rule for the left-sided test is thus
if the \(p-\)value \(\leq \alpha\), we reject \(H_0\) in favour of \(H_1\);
if the \(p-\)value \(>\alpha\), we fail to reject \(H_0\).
Formally, the left-sided test pits \[H_0: \mu=\mu_0 ~\mbox{ against }~ H_1: \mu<\mu_0;\] at significance \(\alpha\), if \(z_0=\frac{\overline{x}-\mu_0}{\sigma/\sqrt{n}} \leq -z_{\alpha}\), we reject \(H_0\) in favour of \(H_1\), as below.
An equivalent right-sided test pits \[H_0: \mu=\mu_0 ~\mbox{ against }~ H_1: \mu>\mu_0;\] at significance \(\alpha\), if \(z_0=\frac{\overline{x}-\mu_0}{\sigma/\sqrt{n}} \geq z_{\alpha}\), we reject \(H_0\) in favour of \(H_1\), as below.
The two-sided test pits \[H_0: \mu=\mu_0 ~\mbox{ against }~ H_1: \mu\neq \mu_0;\] at significance \(\alpha\), if \(|z_0|=\left|\frac{\overline{x}-\mu_0}{\sigma/\sqrt{n}}\right| \geq z_{\alpha/2}\), we reject \(H_0\) in favour of \(H_1\).
The procedure to test for \(H_0:\mu=\mu_0\) requires 6 steps.
Step 1: set \(H_0:\mu=\mu_0\).
Step 2: select an alternative hypothesis \(H_1\) (what we are trying to show using the data). Depending on the context, we choose one of these alternatives:
\(H_1:\mu<\mu_0\) (one-sided test);
\(H_1:\mu>\mu_0\) (one-sided test);
\(H_1:\mu\not=\mu_0\) (two-sided test).
Step 3: choose \(\alpha=P(\text{{type I error}})\), typically \(\alpha\in\{0.01, 0.05\}\).
Step 4: for the observed sample \(\{x_1,\ldots,x_n\}\), compute the observed value of the test statistics \(z_0=\frac{\overline{x}-\mu_0}{\sigma/\sqrt{n}}\).
Step 5: determine the critical region according to:
Alternative Hypothesis | Critical Region |
---|---|
\(H_1:\mu>\mu_0\) | \(z_0>z_\alpha\) |
\(H_1:\mu<\mu_0\) | \(z_0<-z_\alpha\) |
\(H_1:\mu\neq \mu_0\) | \(|z_0|>z_{\alpha/2}\) |
where \(z_{\alpha}\) is the critical value satisfying \(P(Z>z_{\alpha})=\alpha\,,\) for \(Z\sim\mathcal{N}(0,1)\). The critical values are displayed below for convenience.
\(\alpha\) | \(z_{\alpha}\) | \(z_{\alpha/2}\) |
---|---|---|
\(0.05\) | \(1.645\) | \(1.960\) |
\(0.01\) | \(2.327\) | \(2.576\) |
Step 6: compute the associated \(p-\)value according to:
Alt. Hypothesis | Critical Region |
---|---|
\(H_1:\mu>\mu_0\) | \(P(Z>z_0)\) |
\(H_1:\mu<\mu_0\) | \(P(Z<z_0)\) |
\(H_1:\mu\neq \mu_0\) | \(2\cdot \min \{P(Z>z_0),P(Z<z_0)\}\) |
Decision Rule: as above,
if the \(p-\)value \(\leq \alpha\), reject \(H_0\) in favour of \(H_1\);
if the \(p-\)value \(>\alpha\), fail to reject \(H_0\).
A few examples will clarify the procedure.
Examples:
Components are manufactured to have strength normally distributed with mean \(\mu=40\) units and standard deviation \(\sigma=1.2\) units. The manufacturing process has been modified, and an increase in mean strength is claimed (the standard deviation remains the same).
A random sample of \(n=12\) components produced using the modified process had the following strengths:
42.5, 39.8, 40.3, 43.1, 39.6, 41.0,
39.9, 42.1, 40.7, 41.6, 42.1, 40.8.
Does the data provide strong evidence that the mean strength now exceeds 40 units? Use \(\alpha=0.05\).
Answer: we follow the outlined procedure to test for \(H_0: \mu=40\) against \(H_1: \mu>40\).
The observed value of the sample mean is \(\overline{x}=41.125\). Hence, \[\begin{aligned}\text{$p-$value}&=P(\overline{X}\ge \overline{x})=P(\overline{X}\ge 41.125)\\&=P\left(\frac{\overline{X}-\mu_0}{\sigma/\sqrt{n}}\ge\frac{41.125-\mu_0}{\sigma/\sqrt{n}}\right)\\&= P(Z\ge 3.25)\approx 0.006.\end{aligned}\] As the \(p-\)value is smaller than \(\alpha\), we reject \(H_0\) in favour of \(H_1\).
Another way to see this is that if the model ‘\(\mu=40\)’ is true, then it is very unlikely that we would observe the event \(\{\overline{X}\ge 41.125\}\) entirely by chance, and so the manufacturing process likely has an effect in the claimed direction.
A set of scales works properly if the measurements differ from the true weight by a normally distributed random error term with standard deviation \(\sigma=0.007\) grams. Researchers suspect that the scale is systematically adding to the weights.
To test this hypothesis, \(n=10\) measurements are made on a \(1.0\)g “gold-standard” weight, giving a set of measurements which average out to \(1.0038\)g. Does this provide evidence that the scale adds to the measurement weights? Use \(\alpha=0.05\) and \(0.01\).
Answer: let \(\mu\) be the weight that the scale would record in the absence of random error terms. We test for \(H_0: \mu=1.0\) against \(H_1: \mu>1.0\).
The observed test statistic is \(z_0=\frac{1.0038-1.0}{0.007/\sqrt{10}}\approx 1.7167\). Since \[z_{0.05}=1.645<z_0=1.7167\leq z_{0.01}=2.327,\] we reject \(H_0\) for \(\alpha=0.05\), but we fail to reject \(H_0\) for \(\alpha=0.01\).
Case closed. Right?
In the previous example, assume that we are interested in whether the scale works properly, which means that the investigators think there might be some systematic misreading, but they are not sure in which direction the misreading would occur. Does the sample data provide evidence that the scale is systematically biased? Use \(\alpha=0.05\) and \(0.01\).
Answer: let \(\mu\) be as in the previous example. We test for \(H_0: \mu=1.0\) against \(H_1: \mu\neq 1.0\).
The test statistic is still \(z_0=1.7167\); since \(|z_0|\leq z_{\alpha/2}\) for both \(\alpha=0.05\) and \(\alpha=0.01\), we fail to reject \(H_0\) at either \(\alpha=0.05\) or \(\alpha=0.01\).
Thus, our “reading” of the test statistic depends on what type of alternative hypothesis we have selected (and so, on the overall context).
The marks for an “average” class are normally distributed with mean \(60\) and variance \(100\). Nine students are selected from the class; their average mark is \(55\). Is this subgroup “below average”?
Answer: let \(\mu\) be the true mean of the subgroup. We are testing for \(H_0: \mu=60\) against \(H_1: \mu<60\).
The observed sample test statistic is \[z_0=\frac{55-60}{10/\sqrt{9}}=-1.5.\] The corresponding \(p-\)value is \[P(\overline{X}\le 55)=P(Z\le -1.5)=0.07.\] Thus there is not enough evidence to reject the claim that the subgroup is ‘average’, regardless of whether we use \(\alpha=0.05\) or \(\alpha=0.01\).
We consider the same set-up as in the previous example, but this time the sample size is \(n=100\), not \(9\). Is there some evidence to suggest that this subgroup of students is ‘below average’?
Answer: let \(\mu\) be as before. We are still testing for \(H_0: \mu=60\) against \(H_1: \mu<60\), but this time the observed sample test statistic is \[z_0=\frac{55-60}{10/\sqrt{100}}=-5.\] The corresponding \(p-\)value is \[P(\overline{X}\le 55)=P(Z\le -5)\approx 0.00.\] Thus we reject the claim that the subgroup is ‘average’, regardless of whether we use \(\alpha=0.05\) or \(\alpha=0.01\).
The lesson from the last example is that the sample size plays a role; in general, an estimate obtained from a larger (representative) sample is more likely to be generalizable to the population as a whole.
Or as the iFunny meme has it…
Tests and Confidence Intervals
It is becoming more and more common for analysts to bypass the computation of the \(p-\)value altogether, in favour of a confidence interval based approach.^{47}
For a given \(\alpha\), we reject \(H_0:\mu=\mu_0\) in favour of \(H_1:\mu\not=\mu_0\) if, and only if, \(\mu_0\) is not in the \(100(1-\alpha)\%\) C.I. for \(\mu\).
Example: A manufacturer claims that a particular type of engine uses \(20\) gallons of fuel to operate for one hour. It is known from previous studies that this amount is normally distributed with variance \(\sigma^2=25\) and mean \(\mu\).
A sample of size \(n=9\) has been taken and the following value has been observed for the mean amount of fuel per hour: \(\overline{X}=23\). Should we accept the manufacturer’s claim? Use \(\alpha=0.05\).
Answer: we test for \(H_0: \mu=20\) against \(H_1: \mu\not=20\). The observed sample test statistic is \[z_0=\frac{\overline{x}-\mu_0}{\sigma/\sqrt{n}}=\frac{23-20}{5/\sqrt{9}}=1.8.\] For a \(2-\)sided test with \(\alpha=0.05\), the critical value is \(z_{0.025}=1.96\). Since \(|z_0|\leq z_{0.025}\), \(z_0\) is not in the critical region, and we do not reject \(H_0\).
The advantage of the confidence interval approach is that it allows analysts to test for various claims simultaneously. Since we know the variance of the underlying population, an approximate \(100(1-\alpha)\)% C.I. for \(\mu\) is given by \[\overline{X}\pm z_{\alpha/2}\sigma/\sqrt{n}=23\pm 1.96\cdot 5/\sqrt{9}=(19.73; 26.26).\] Based on the data, we would thus not reject the claim that \(\mu=20\), \(\mu=19.74\), \(\mu=26.20\), etc.
Test for a Mean with Unknown Variance
If the data is normal and \(\sigma\) is unknown, we can estimate it via the sample variance \[S^2=\frac1{n-1}\sum_{i=1}^n \left( X_i-\overline{X} \right)^2.\] As we have seen for confidence intervals, the test statistic \[T=\frac{\overline{X}-\mu}{S/\sqrt n}\sim t(n-1)\] follows a Student’s \(t-\)distribution with \(n-1\) df.
We can follow the same steps as for the test with known variance, with the modified critical regions and \(p-\)values:
Alternative Hypothesis | Critical Region |
---|---|
\(H_1:\mu>\mu_0\) | \(t_0>t_\alpha(n-1)\) |
\(H_1:\mu<\mu_0\) | \(t_0<-t_\alpha(n-1)\) |
\(H_1:\mu\neq \mu_0\) | \(|t_0|>t_{\alpha/2}(n-1)\) |
where \[t_0=\frac{\overline{x}-\mu_0}{S/\sqrt{n}}\] and \(t_{\alpha}(n-1)\) is the \(t-\)value satisfying \[P(T>t_{\alpha}(n-1))=\alpha\,\] for \(T\sim t(n-1)\), and
Alt. Hypothesis | \(p-\)Value |
---|---|
\(H_1:\mu>\mu_0\) | \(P(T>t_0)\) |
\(H_1:\mu<\mu_0\) | \(P(T<t_0)\) |
\(H_1:\mu\neq \mu_0\) | \(2\cdot \min \{P(T>t_0),P(T<t_0)\}\) |
Let’s consider an example.
Example: consider the following observations, taken from a normal population with unknown mean \(\mu\) and variance:
18.0, 17.4, 15.5, 16.8, 19.0, 17.8,
17.4, 15.8, 17.9, 16.3, 16.9, 18.6,
17.7, 16.4, 18.2, 18.7.
Conduct a right-side hypothesis test for \(H_0:\mu=16.6\) against \(H_1:\mu>16.6\), using \(\alpha=0.05\).
Answer: the sample size, sample mean, and sample variance are \(n=16\), \(\overline{X}=17.4\) and \(S=1.078\), respectively.
Since the variance \(\sigma^2\) is unknown, the observed sample test statistics of interest is \[t_0=\frac{\overline{x}-\mu_0}{S/\sqrt{n}}=\frac{17.4-16.6}{1.078 /4}\approx 2.968,\] and the corresponding \(p-\)value is \[\text{$p-$value }=P(\overline{X}\ge 17.4)= P(T>2.968),\] where \(T\sim t(n-1)=t(\nu)=t(15)\).
From the \(t-\)tables (or by using the R
function qt()
), we see that \[P\left( T(15)\geq2.947 \right)\approx0.005, \ P\left(
T(15)\geq3.286 \right)\approx0.0025.\] The \(p-\)value thus lies in the interval \((0.0025,0.005)\); in particular, the \(p-\)value \(\leq 0.05\), which is strong evidence against \(H_0\colon\mu=16.6\).
4.4.4 Test for a Proportion
The principle for proportions is pretty much the same; as we can see in the next example.
Example: a group of \(100\) adult American Catholics were asked the following question: “Do you favour allowing women into the priesthood?” \(60\) of the respondents independently answered ‘Yes’; is the evidence strong enough to conclude that more than half of American Catholics favour allowing women to be priests?
Answer: let \(X\) be the number of people who answered ‘Yes’. We assume that \(X\sim \mathcal{B} (100,p)\), where \(p\) is the true proportion of American Catholics who favour allowing women to be priests.
We thus test for \(H_0:p=0.5\) against \(H_1: p>0.5\). Under \(H_0\), \(X\sim \mathcal{B}(100,0.5)\).
The \(p-\)value that corresponds to the observed sample is \[\begin{aligned}P(X\ge 60)&= 1-P(X<60)=1-P(X\le 59)\\ &\approx 1-P\left(\frac{X{+0.5}-np}{\sqrt{np(1-p)}}\le \frac{59{+0.5}-50}{\sqrt{25}}\right)\\ &\approx 1-P(Z\le 1.9)=0.0287, \end{aligned}\] where the \(+0.5\) comes from the correction to the normal approximation of the binomial distribution (see Section 3.3.6 for details).
Thus, we would reject \(H_0\) at \(\alpha=0.05\), but not at \(\alpha=0.01\).
4.4.5 Two-Sample Tests
Up to this point, we have only tested hypotheses about populations by evaluating the evidence provided by a single sample of observations.
Two-sample tests allows analysts to compare two (potentially distinct) populations.
Paired Test
Let \(X_{1,1},\ldots,X_{1,n}\) be a random sample from a normal population with unknown mean \(\mu_1\) and unknown variance \(\sigma^2\); let \(X_{2,1},\ldots,X_{2,n}\) be a random sample from a normal population with unknown mean \(\mu_2\) and unknown variance \(\sigma^2\), with both populations not necessarily independent of one another (i.e., it is possible that the \(2\) samples arise from the same population, or represent two different measurements on the same units). We would like to test for \(H_0:\mu_1=\mu_2\) against \(H_1:\mu_1\not=\mu_2\).
In order to do so, we compute the differences \(D_i=X_{1,i}-X_{2,i}\) and consider the \(t-\)test (as we do not know the variance). The test statistic is \[T_0=\frac{\overline{D}}{S_D/\sqrt{n}}\sim t(n-1),\] where \[\overline{D}=\frac{1}{n}\sum_{i=1}^{n}D_i \mbox{ and } S_D^2=\frac{1}{n-1}\sum_{i=1}^{n}(D_{i}-\overline{D})^2.\]
Example: the knowledge of basic statistical concepts for \(n=10\) engineers was measured on a scale from \(0-100\) before and after a short course in statistical quality control. The result are as follows:
Engineer | \(1\) | \(2\) | \(3\) | \(4\) | \(5\) |
---|---|---|---|---|---|
Before \(X_{1,i}\) | \(43\) | \(82\) | \(77\) | \(39\) | \(51\) |
After \(X_{2,i}\) | \(51\) | \(84\) | \(74\) | \(48\) | \(53\) |
Engineer | \(6\) | \(7\) | \(8\) | \(9\) | \(10\) |
---|---|---|---|---|---|
Before \(X_{1,i}\) | \(66\) | \(55\) | \(61\) | \(79\) | \(43\) |
After \(X_{2,i}\) | \(61\) | \(59\) | \(75\) | \(82\) | \(48\) |
Let \(\mu_1\) and \(\mu_2\) be the mean score before and after the course, respectively. Assuming the underlying scores are normally distributed, test for \(H_0:\mu_1=\mu_2\) against \(H_1:\mu_1<\mu_2\).
Answer: The differences \(D_i=X_{1,i}-X_{2,i}\) are:
Engineer | \(1\) | \(2\) | \(3\) | \(4\) | \(5\) |
---|---|---|---|---|---|
Before \(X_{1,i}\) | \(43\) | \(82\) | \(77\) | \(39\) | \(51\) |
After \(X_{2,i}\) | \(51\) | \(84\) | \(74\) | \(48\) | \(53\) |
Difference \(D_i\) | \(-8\) | \(-2\) | \(3\) | \(-9\) | \(-2\) |
Engineer | \(6\) | \(7\) | \(8\) | \(9\) | \(10\) |
---|---|---|---|---|---|
Before \(X_{1,i}\) | \(66\) | \(55\) | \(61\) | \(79\) | \(43\) |
After \(X_{2,i}\) | \(61\) | \(59\) | \(75\) | \(82\) | \(48\) |
Difference \(D_i\) | \(5\) | \(-4\) | \(-14\) | \(-3\) | \(-5\) |
The observed sample mean is \(\overline{d}=-3.9\), and the observed sample variance is \(s_D^2=31.21\).
The test statistic is: \[T_0=\frac{\overline{D}-0}{S_D/\sqrt{n}}\sim t(n-1),\] with observed value: \[t_0=\frac{-3.9}{\sqrt{31.21/10}}\approx {-2.21}.\] We compute \[\begin{aligned} P(\overline{D}\le -3.9)&= P(T(9)\le -2.21)=P(T(9)>2.21).\end{aligned}\]
But \(t_{0.05}(9)=1.833<t_0=2.21<t_{0.01}(9)=2.821\), so we reject \(H_0\) at \(\alpha=0.05\), but not at \(\alpha=0.01\).
Unpaired Test
Let \(X_{1,1},\ldots,X_{1,n}\) be a random sample from a normal population with unknown mean \(\mu_1\) and variance \(\sigma_1^2\); let \(Y_{2,1},\ldots,Y_{2,m}\) be a random sample from a normal population with unknown mean \(\mu_2\) and variance \(\sigma_2^2\), with both populations independent of one another.
We want to test for \[H_0:\mu_1=\mu_2 ~~\mbox{against}~~ H_1:\mu_1\not=\mu_2.\] Let \(\overline{X}=\frac{1}{n}\sum_{i=1}^{n}X_{i}~,~~ \overline{Y}=\frac{1}{m}\sum_{i=1}^{m}Y_{i}\,\,\). As always, the observed values are denoted by lower case letters: \(\overline{x}\), \(\overline{y}\).
\(\sigma_1^2\) and \(\sigma_2^2\) Known
We can follow the same steps as for the earlier test, with some modifications:
Alternative Hypothesis | Critical Region |
---|---|
\(H_1:\mu_1>\mu_2\) | \(z_0>z_\alpha\) |
\(H_1:\mu_1<\mu_2\) | \(z_0<-z_\alpha\) |
\(H_1:\mu_1\neq \mu_2\) | \(|z_0|>z_{\alpha/2}\) |
where \[z_0=\frac{\overline{x}-\overline{y}}{\sqrt{\sigma_1^2/n+\sigma_2^2/m}}\,,\] and \(z_{\alpha}\) satisfies \(P(Z>z_{\alpha})=\alpha\,,\) for \(Z\sim \mathcal{N}(0,1)\).
Alt Hypothesis | \(p-\)Value |
---|---|
\(H_1:\mu_1>\mu_2\) | \(P(Z>z_0)\) |
\(H_1:\mu_1<\mu_2\) | \(P(Z<z_0)\) |
\(H_1:\mu_1\neq \mu_2\) | \(2\cdot \min \{P(Z>z_0),P(Z<z_0)\}\) |
Let us consider an example.
Example: a sample of \(n=100\) Albertans yields a sample mean income of \(\overline{X}=33,000\$\). A sample of \(m=80\) Ontarians yields \(\overline{Y}=32,000\$\). From previous studies, it is known that the population income standard deviations are, respectively, \(\sigma_1=5000\$\) in Alberta and \(\sigma_2=2000\$\) in Ontario. Do Albertans earn more than Ontarians, on average?
Answer: we test for \(H_0:\mu_1=\mu_2\) against \(H_1:\mu_1>\mu_2\). The observed difference is \(\overline{X}-\overline{Y}=1000\); the observed test statistic is \[z_0=\frac{\overline{X}-\overline{Y}}{\sqrt{\sigma_1^2/n+\sigma_2^2/m}}=\frac{1000}{\sqrt{5000^2/100+2000^2/80}}=1.82;\] the corresponding \(p-\)value is \[P\left(\overline{X}-\overline{Y}>1000\right)= P(Z>1.82)=0.035,\] and so we reject \(H_0\) when \(\alpha=0.05\), but not when \(\alpha=0.01\).
\(\sigma_1^2\) and \(\sigma_2^2\) Unknown, with Small Samples
In this case, the modifications are:
Alternative Hypothesis | Critical Region |
---|---|
\(H_1:\mu_1>\mu_2\) | \(t_0>t_\alpha(n+m-2)\) |
\(H_1:\mu_1<\mu_2\) | \(t_0<-t_\alpha(n+m-2)\) |
\(H_1:\mu_1\neq \mu_2\) | \(|t_0|>t_{\alpha/2}(n+m-2)\) |
where \[t_0=\frac{\overline{X}-\overline{Y}}{\sqrt{S_p^2/n+S_p^2/m}}~~\text{and}~~S_p^2=\frac{(n-1)S_1^2+(m-1)S_2^2}{n+m-2},\] \(t_{\alpha}(n+m-2)\) satisfies \(P(T>t_{\alpha}(n+m-2))=\alpha\,,\) and \(T\sim t(n+m-2)\).
Alt Hypothesis | \(p-\)Value |
---|---|
\(H_1:\mu_1>\mu_2\) | \(P(T>t_0)\) |
\(H_1:\mu_1<\mu_2\) | \(P(T<t_0)\) |
\(H_1:\mu_1\neq \mu_2\) | \(2\cdot \min \{P(T>t_0),P(T<t_0)\}\) |
Yet again, an example.
Example: a researcher wants to test whether, on average, a new fertilizer yields taller plants. Plants were divided into two groups: a control group treated with an old fertilizer and a study group treated with the new fertilizer. The following data are obtained:
Sample Size | Sample Mean | Sample Variance |
---|---|---|
\(n=8\) | \(\overline{X}=43.14\) | \(S_1^2=71.65\) |
\(m=8\) | \(\overline{Y}=47.79\) | \(S_2^2=52.66\) |
Test for \(H_0:\mu_1=\mu_2\) vs. \(H_1:\mu_1<\mu_2\).
Answer: the observed difference is \(\overline{X}-\overline{Y}=-4.65\) and the pooled sampled variance is \[\begin{aligned} S_p^2&=\frac{(n-1)S_1^2+(m-1)S_2^2}{n+m-2}\\&=\frac{7(71.65)+7(52.66)}{8+8-2}=62.155=7.88^2.\end{aligned}\] The observed test statistic is \[t_0=\frac{\overline{X}-\overline{Y}}{\sqrt{S_p^2/n+S_p^2/m}}=\frac{-4.65}{7.88\sqrt{1/8+1/8}}=-1.18;\] the corresponding \(p-\)value is \[\begin{aligned} P\left(\overline{X}-\overline{Y}<-4.65\right)&= P(T(14)<-1.18)\\&=P(T(14)>1.18) \in (0.1,0.25)\end{aligned}\] (according to the table), and we do not reject \(H_0\) when \(\alpha=0.05\), or when \(\alpha=0.01\).
\(\sigma_1^2\) and \(\sigma_2^2\) Unknown, with Large Samples
In this case, the modifications are:
Alternative Hypothesis | Critical Region |
---|---|
\(H_1:\mu_1>\mu_2\) | \(z_0>z_\alpha\) |
\(H_1:\mu_1<\mu_2\) | \(z_0<-z_\alpha\) |
\(H_1:\mu_1\neq \mu_2\) | \(|z_0|>z_{\alpha/2}\) |
where \[z_0=\frac{\overline{X}-\overline{Y}}{\sqrt{S_1^2/n+S_2^2/m}},\] and \(z_{\alpha}\) satisfies \(P(Z>z_{\alpha})=\alpha\,,\) for \(Z\sim \mathcal{N}(0,1)\).
Alt Hypothesis | \(p-\)Value |
---|---|
\(H_1:\mu_1>\mu_2\) | \(P(Z>z_0)\) |
\(H_1:\mu_1<\mu_2\) | \(P(Z<z_0)\) |
\(H_1:\mu_1\neq \mu_2\) | \(2\cdot \min \{P(Z>z_0),P(Z<z_0)\}\) |
A last example is shown below.
Example: consider the same set-up as in the previous example, but with larger sample sizes: \(n=m=100\). Now test for \(H_0:\mu_1=\mu_2\) against \(H_1:\mu_1<\mu_2\).
Answer: the observed difference is (still) \(-4.65\). The observed test statistic is \[z_0=\frac{\overline{X}-\overline{Y}}{\sqrt{S_1^2/n+S_2^2/m}}=\frac{-4.65}{\sqrt{71.65^2/100+52.66^2/100}}=-4.17;\] the corresponding \(p-\)value is \[P\left(\overline{X}-\overline{Y}<-4.65\right)= P(Z<-4.17)\approx 0.0000;\] and we reject \(H_0\) when either \(\alpha=0.05\) or \(\alpha=0.01\).
4.4.6 Difference of Two Proportions
As always, we can transfer these tests to proportions, using the normal approximation to the binomial distribution.
For instance, to test for \(H_0:p_1=p_2\) against \(H_1:p_1\not=p_2\) in samples of size \(n_1\), \(n_2\), respectively, we use the observed sample difference of proportions \[z_0=\frac{\hat p_1-\hat p_2-0}{\sqrt{\hat p(1-\hat p)}\sqrt{1/n_1+1/n_2}},\] where \(\hat p\) is the pooled proportion \[\hat p=\frac{n_1}{n_1+n_2}\hat p_1+\frac{n_2}{n_1+n_2}\hat p_2.\] and the \(p-\)value \(2\cdot \min \{P(Z>z_0),P(Z<z_0)\}\).
4.4.7 Hypothesis Testing with R
There are built in functions in R
that allow for hypothesis testing. For instance,
t.test(x,mu=mu.0)
tests for \(H_0:\mu=\mu_0\) against \(H_1:\mu\not=\mu_0\) when \(\sigma\) is unknown (\(2-\)sided \(t-\)test);t.test(x,mu=mu.0,alternative="greater")
tests for \(H_0:\mu=\mu_0\) against \(H_1:\mu>\mu_0\) when \(\sigma\) is unknown (right-sided \(t-\)test);t.test(x,mu=mu.0,alternative="less")
tests for \(H_0:\mu=\mu_0\) against \(H_1:\mu<\mu_0\) when \(\sigma\) is unknown (left-sided \(t-\)test);t.test(x,y,var.equal=TRUE)
tests for \(H_0:\mu_1=\mu_2\) against\(H_1:\mu_1\neq \mu_2\) in case of two independent samples, when variances are unknown but equal;t.test(x,y,var.equal=TRUE,alternative="greater")
tests for \(H_0:\mu_1=\mu_2\) against \(H_1:\mu_1>\mu_2\) in case of two independent samples, when variances are unknown but equal;t.test(x,y,var.equal=TRUE,alternative="less")
tests for \(H_0:\mu_1=\mu_2\) against \(H_1:\mu_1<\mu_2\) in case of two independent samples, when variances are unknown but equal.
For all these tests, we reject the null hypothesis \(H_0\) at significance level \(\alpha\) if the \(p-\)value of the test is below \(\alpha\) (which means that the probability of wrongly rejecting \(H_0\) when \(H_0\) is in fact true is below \(\alpha\), usually taken to be \(0.05\) or \(0.01\)).
If the \(p-\)value of the test is greater than the significance level \(\alpha\), then we fail to reject the null hypothesis \(H_0\) at significance level \(\alpha\).^{48}
Note that the \(p-\)value for the test will appear in the output, but it can also be computed directly using the appropriate formula. The corresponding \(95\)% confidence intervals also appear in the output.
Artificial Examples
Let’s say that we have a small dataset with \(n=7\) observations:
Let \(\mu_X\) be the true mean of whatever distribution the sample came from. Is it conceivable that \(\mu_X=5\)?
Solution: we can test for \(H_0: \mu_X=5\) against \(H_1:\mu_X\neq 5\) simply by calling
One Sample t-test data: x t = -1.4412, df = 6, p-value = 0.1996 alternative hypothesis: true mean is not equal to 5 95 percent confidence interval: 3.843764 5.299093 sample estimates: mean of x 4.571429
All the important information is in the output: the critical \(t-\)value from Student’s \(T-\)distribution with \(n-1=6\) degrees of freedom \(t^*=-1.4412\), the probability of wrongly rejecting \(H_0\) if it was in fact true (\(p-\)value \(=0.1996\)), and the \(95\)% confidence interval \((3.843764, 5.299093)\) for \(\mu_X\), whose point estimate is \(\overline{x}=4.571429\).
Since the \(p-\)value is greater than \(\alpha=0.05\), we fail to reject the null hypothesis that \(\mu_X=5\); there is not enough evidence in the data to categorically state that \(\mu_X\neq 5\).
(Is it problematic that the sample size \(n=7\) is small?)
Let’s say that now we have a small dataset with \(n=9\) observations:
Let \(\mu_Y\) be the true mean of whatever distribution the sample came from. Is it conceivable that \(\mu_Y=5\)?
Solution: we can test for \(H_0: \mu_Y=5\) against \(H_1:\mu_Y\neq 5\) simply by calling
One Sample t-test data: y t = -6.7823, df = 8, p-value = 0.0001403 alternative hypothesis: true mean is not equal to 5 95 percent confidence interval: 1.575551 3.313338 sample estimates: mean of x 2.444444
The \(p-\)value is \(0.0001403\), which is substantially smaller than \(\alpha=0.05\), and we reject the null hypothesis that the true mean is \(5\). The test provides no information about what the true mean could be, but the \(95\)% confidence interval \((1.575551, 3.313338)\) does: we would expect \(\mu_Y\approx 2.5\).
Is it conceivable that \(\mu_Y=2.5\)?
Solution: let’s run
One Sample t-test data: y t = -0.14744, df = 8, p-value = 0.8864 alternative hypothesis: true mean is not equal to 2.5 95 percent confidence interval: 1.575551 3.313338 sample estimates: mean of x 2.444444
Large \(p-\)value, much above \(\alpha=0.05\)… we’d like to say that this clinches it: we accept the null hypothesis! But no, no. We can’t do that. All we can say, sadly, is that we do not have enough evidence to reject the null hypothesis, i.e., we cannot reject the hypothesis that \(\mu_Y=2.5\).
Teaching Dataset
Suppose that a researcher wants to determine if, as she believes, a new teaching method enables students to understand elementary statistical concepts better than the traditional lectures given in a university setting (based on [38]).
She recruits \(N=80\) second-year students to test her claim. The students are randomly assigned to one of two groups:
students in group \(A\) are given the traditional lectures,
whereas students in group \(B\) are taught using the new teaching method.
After three weeks, a short quiz is administered to the students in order to assess their understanding of statistical concepts.
The results are found in the teaching.csv
dataset.
ID | Group | Grade |
---|---|---|
1 | B | 75.5 |
2 | B | 77.5 |
3 | A | 73.5 |
4 | A | 75.0 |
5 | B | 77.0 |
6 | A | 79.0 |
7 | B | 79.0 |
8 | B | 75.5 |
9 | B | 80.0 |
10 | B | 79.5 |
11 | B | 81.5 |
12 | B | 81.0 |
13 | B | 77.5 |
14 | B | 74.0 |
15 | A | 73.0 |
16 | A | 76.5 |
17 | B | 80.0 |
18 | A | 73.5 |
19 | B | 79.5 |
20 | A | 74.5 |
21 | B | 81.5 |
22 | A | 70.0 |
23 | B | 78.5 |
24 | A | 74.5 |
25 | B | 81.5 |
26 | B | 79.0 |
27 | B | 81.5 |
28 | A | 76.0 |
29 | B | 78.0 |
30 | B | 79.0 |
31 | B | 77.0 |
32 | B | 74.0 |
33 | B | 79.5 |
34 | A | 77.5 |
35 | A | 68.5 |
36 | A | 76.0 |
37 | A | 71.5 |
38 | A | 78.5 |
39 | B | 82.0 |
40 | B | 76.5 |
41 | A | 77.5 |
42 | A | 78.5 |
43 | A | 78.5 |
44 | B | 77.5 |
45 | B | 76.5 |
46 | B | 83.0 |
47 | A | 76.0 |
48 | A | 74.0 |
49 | B | 81.5 |
50 | A | 75.5 |
51 | A | 75.0 |
52 | B | 80.5 |
53 | A | 75.5 |
54 | B | 78.5 |
55 | A | 74.5 |
56 | A | 72.5 |
57 | A | 72.5 |
58 | B | 80.5 |
59 | A | 76.0 |
60 | A | 76.0 |
61 | A | 78.0 |
62 | A | 75.0 |
63 | B | 84.0 |
64 | A | 78.0 |
65 | B | 81.0 |
66 | A | 77.0 |
67 | B | 78.0 |
68 | A | 77.0 |
69 | A | 75.0 |
70 | A | 70.5 |
71 | B | 78.0 |
72 | B | 78.5 |
73 | B | 76.0 |
74 | B | 80.0 |
75 | A | 74.0 |
76 | A | 79.5 |
77 | A | 74.5 |
78 | A | 71.0 |
79 | B | 81.0 |
80 | A | 76.0 |
Is there enough evidence to suggest that the new teaching is more effective (as measured by test performance)?
Solution: we can summarize the results (sample size, sample mean, sample variance) as follows:
library(dplyr)
counts.by.group = aggregate(x = teaching$Grade,
by = list(teaching$Group),
FUN = length)
means.by.group = aggregate(x = teaching$Grade,
by = list(teaching$Group),
FUN = mean)
variances.by.group = aggregate(x = teaching$Grade,
by = list(teaching$Group),
FUN = var)
teaching.summary <- counts.by.group |>
full_join(means.by.group,
by="Group.1" ) |>
full_join(variances.by.group,
by="Group.1" )
colnames(teaching.summary) <- c("Group",
"Sample Size",
"Sample Mean",
"Sample Variance")
Group | Sample Size | Sample Mean | Sample Variance |
---|---|---|---|
A | 40 | 75.125 | 6.650641 |
B | 40 | 79.000 | 5.538462 |
If the researcher assumes that both groups have similar background knowledge prior to being taught (which she attempt to enforce by randomising the group assignment), then the effectiveness of the teaching methods may be compared using two hypotheses: the null hypothesis \(H_0\) and the alternative \(H_{1}\).
Let \(\mu_i\) represent the true performance of method \(i\).
Since the researcher wants to claim that the new method is more effective than the traditional ones, it is most appropriate for her to use one-sided hypothesis testing with \[H_{0}: \mu_{A} \geq \mu_{B} \quad\mbox{against}\quad H_{1}: \mu_{A} < \mu_{B}.\]
The testing procedure is simple:
calculate an appropriate test statistic under \(H_0\);
reject \(H_0\) in favour of \(H_1\) if the test statistic falls in the critical region (also called the rejection region) of an associated distribution, and
fail to reject \(H_0\) otherwise.
In this case, she wants to use a two-sample \(t-\)test. Assuming that variability in two groups are roughly the same, the test statistic is given by:
\[ t_{0}=\frac{\overline{y}_{B}-\overline{y}_{A}}{S_{p}\sqrt{\frac{1}{N_{A}}+\frac{1}{N_{B}}}}, \]
where the pooled variance \(S^{2}_{p}\) is
\[ S^{2}_{p}=\frac{(N_{A}-1)S^{2}_{A}+(N_{B}-1)S^{2}_{B}}{N_{A}+N_{B}-2}. \]
With her data, she obtains the \(t-\)statistic as follows. First, she identifies the number of observations in each group:
[1] 40
[1] 40
[1] 80
Then, she computes the sample mean score in each group:
[1] 75.125
[1] 79
She computes the sample variance of the scores in each group:
[1] 6.650641
[1] 5.538462
She finally computes the sample pooled variance of scores:
[1] 6.094551
From which she obtains the \(t-\)statistic:
[1] 7.019656
The test statistic value is \(t_{0} = 7.02\).
In order to reject or fail to reject the null hypothesis, she needs to compare it against the critical value of the Student \(T\) distribution with \(N-2=78\) degrees of freedom at significance level \(\alpha=0.05\), say.
Set the significance level at 0.05:
Be careful with the qt()
function – the next call “looks” right, but it will give you a critical value on the wrong side of the distribution’s mean:
[1] -1.664625
This call, however, gives the correct critical value:
[1] 1.664625
The appropriate critical value is \[t^*= t_{1-\alpha, N-2}=t_{0.95, 78}=1.665.\]
Since \(t_{0} > t^*\) at \(\alpha=0.05\), she happily rejects the null hypothesis \(H_0:\mu_A\geq \mu_B\), which is to say that she has enough evidence to support the claim that the new teaching method is more effective than the traditional methods, at \(\alpha=0.05\).