## 4.2 Descriptive Statistics

As its name implies, **descriptive statistics** aim to describe the data; examples include:

sample size (overall and/or subgroups);

demographic breakdowns of participants;

measures of central tendencies (e.g., mean, median, mode, etc.);

measures of variability (e.g., sample variance, minimum, maximum, interquartile range, etc.);

higher distribution moments (skew, kurtosis, etc.);

non-parametric measures (various quantiles);

derived measures (correlation coefficients), etc.

They can be presented as a single number, in a summary table, or even in graphical representations (e.g., histogram, pie chart, etc.).

### 4.2.1 Data Descriptions

Studies and experiments give rise to **statistical units**. These units are typically described with **variables** (and measurements), which are either **qualitative** (categorical) or **quantitative** (numerical).

Categorical variables take values (**levels**) from a finite set of pre-determined **categories** (or classes); numerical variables from a (potentially infinite) set of **quantities**.

**Examples:**

Age is a numerical variable, measured in years, although is is often reported to the nearest year integer, or in an age range of years, in which case it is an

**ordinal**variable (mixture of qualitative or quantitative).Typical numerical variables include distance in m, volume in cm\(^3\), etc.

Disease diagnosis is a categorical variable with (at least) 2 categories (positive/negative).

Compliance with a standard is a categorical variable: there could be 2 levels (compliant/non-compliant) or more (compliance, minor non-compliance issues, major non-compliance issues).

Count variables are numerical variables.

#### Numerical Summaries

In a first pass, a variable can be described along (at least) 2 dimensions: its **centrality** and its **spread** (the **skew** and the **kurtosis** are sometimes also used):

**centrality**measures include the**median**, the**mean**, and, less frequently, the**mode**;spread (or

**dispersion**) measures include the**standard deviation**(sd), the**quartiles**, the**inter-quartile range**(IQR), and, less frequently, the**range**.

The median, range, and quartiles are all easily calculated from an **ordered** list of the data.

#### Sample Median

The **median** \(\text{med}(x_1,\ldots,x_n)\) of a sample of size \(n\) is a numerical value which splits the ordered data into \(2\) equal subsets: half the observations fall **below** the median, and half **above** it:

if \(n\) is

**odd**, then the**position**of the median (or its**rank**) is \((n+1)/2\) – the median observation is the \(\frac{n+1}{2}^{\text{th}}\) ordered observation;if \(n\) is

**even**, then the median is the average of the \(\frac{n}{2}^{\text{th}}\) and the \((\frac{n}{2}+1)^{\text{th}}\) ordered observations.

The procedure is simple: order the data, and follow the even/odd rules **to the letter**.

**Examples:**

\(\text{med}(4,6,1,3,7)=\text{med}(1,3,4,6,7)=x_{(5+1)/2}=x_3=4\). There are \(2\) observations below \(4\) \(\{1,3\}\), and \(2\) observations above \(4\) \(\{6,7\}\).

\(\text{med}(1,3,4,6,7,23)=\frac{x_{6/2}+x_{6/2+1}}{2}=\frac{x_3+x_4}{2}=\frac{4+6}{2}=5\). There are \(3\) observations below \(5\) \(\{1,3,4\}\), and \(3\) observations above \(4\) \(\{6,7,23\}\).

\(\text{med}(1,3,3,6,7)=x_{(5+1)/2}=x_3=3\). There seems to be only \(1\) observation below \(3\) \(\{1\}\), but \(2\) observations above \(3\) \(\{6,7\}\).

Note that there is ambiguity in the definition of the median: **above** and **below** should be interpreted as **after** and **before**, respectively, inclusive of the median value. In the last example above, for instance, there are \(2\) observations (\(x_1=1,x_2=3\)) before the median observation (\(x_3=3)\), and \(2\) after the median (\(x_4=6,x_5=7\)).

#### Sample Mean

The **mean** of a sample is simply the arithmetic average of its observations. For observations \(x_1, \ldots, x_n\), the sample mean is \[\begin{aligned} \text{AM}(x_1,\ldots,x_n)=\overline{x}= \frac{x_1+ \cdots+ x_n}{n} = \frac{1}{n}\left(\sum^{n}_{i=1} x_i\right)\end{aligned}\]
Other means exist, such as the **harmonic** mean and the **geometric** mean: \[\begin{aligned} \text{HM}(x_1,\ldots,x_n)&=\frac{n}{\frac{1}{x_1}+\cdots+\frac{1}{x_n}} \\ \text{GM}(x_1,\ldots,x_n)&=\sqrt[n]{x_1\cdots x_n}. \end{aligned}\]

All of these measures attempt to find an “average” of the observations.

**Examples:**

\(\text{AM}(4,6,1,3,7)=\frac{4+6+1+3+7}{5}=\frac{21}{5}=4.2 \approx 4= \text{med}(4,6,1,3,7)\).

\(\text{AM}(1,3,4,6,7,23)=\frac{1+3+4+6+7+23}{6}=\frac{44}{6}\approx 7.3\), which is not nearly as close to \(\text{med}(1,3,4,6,7,23)=5\).

\(\text{HM}(4,6,1,3,7)=\frac{5}{\frac{1}{4}+\frac{1}{6}+\frac{1}{1}+\frac{1}{3}+\frac{1}{7}}=\frac{5}{53/28}=\frac{140}{53}\approx 2.64\).

\(\text{GM}(4,6,1,3,7)=\sqrt[5]{4\cdot 6\cdot 1\cdot 3\cdot 7}\approx \sqrt[5](504)\approx 3.47\).

It can be shown that if \(x=(x_1,\ldots,x_n)\) and \(x_i>0\) for all \(i\), then \[\min(x)\leq\text{HM}(x)\leq \text{GM}(x)\leq \text{AM}(x)\leq \max(x).\] There is no need to decide on a single centrality measure when reporting on the data; in practice, we may use as many of them as we want to.

But there are situations where the mean (or the median) could prove to be a better choice. On the one hand, the use of the mean is **theoretically supported** by the Central Limit Theorem.

When the data distribution is roughly **symmetric**, then the median and the mean will be near one another. If the data distribution is **skewed** then the mean is pulled toward the long tail and as a result gives a distorted view of the centre (see Figure 4.1).

Consequently, medians are generally used for house prices, incomes, etc., as the median is **robust** against outliers and incorrect readings (whereas the mean is not).

#### Standard Deviation

While the mean, the median, and the mode provide an idea as to where some of the distribution’s “mass” is located, the **standard deviation** provides some notion of its spread. The higher the standard deviation, the further away from the mean the variable values are likely to fall (see Figure 4.2). We will have more to say on this topic.

#### Quantiles

Another way to provide information about the spread of the data is *via* **centiles**, **deciles**, and/or **quartiles**.

The **lower quartile** \(Q_1(x_1,\ldots,x_n)\) of a sample of size \(n\), or \(Q_1\), is a numerical value which splits the ordered data into \(2\) unequal subsets: \(25\)% of the observations fall below \(Q_1\) **and** \(75\)% of the observations fall above \(Q_1\).

Similarly, the **upper quartile** \(Q_3\) splits the ordered data into \(75\)% of the observations below \(Q_3\), **and** \(25\)% of the observations above \(Q_3\).

The median can be interpreted as the **middle quartile** \(Q_2\), of the sample, the minimum as \(Q_0\), and the maximum as \(Q_4\): the vector \((Q_0,Q_1,Q_2,Q_3,Q_4)\) is the **5-pt summary** of the data.

**Centiles** \(p_i\), \(i=0,\ldots, 100\) and **deciles** \(d_j\), \(j=0,\ldots, 10\) run through different splitting percentages \[p_{25}=Q_1, p_{75}=Q_3, d_5=Q_2,\ \text{etc.}\]

They are found as with the media: **sort** the sample observations \(\{x_1, x_2, \ldots, x_n\}\) in an **increasing order** as \[y_1\leq y_2\leq\ldots\leq y_n.\] The smallest \(y_1\) has **rank** \(1\) and the largest \(y_n\) has **rank** \(n\).

Any value that falls between the observations of ranks:

\(\lfloor\frac{n}{4}\rfloor\) and \(\lfloor\frac{n}{4}\rfloor+1\) is a

**lower quartile**\(Q_1\);\(\lfloor\frac{3n}{4}\rfloor\) and \(\lfloor\frac{3n}{4}\rfloor+1\) is an

**upper quartile**\(Q_3\);\(\lfloor\frac{in}{100}\rfloor\) and \(\lfloor\frac{in}{100}\rfloor+1\) is a

**centile**\(p_i\), for \(i=1,\ldots,99\);\(\lfloor\frac{jn}{10}\rfloor\) and \(\lfloor\frac{jn}{10}\rfloor+1\) is a

**decile**\(d_j\), for \(j=1,\ldots,9\).

In practice, we compute the **\(m-\)quantile of order \(k\)** for the data, where \(k=1,\ldots,m-1\) by averaging the observations of rank \[\left\lfloor\frac{kn}{m}\right\rfloor\quad\text{and}\quad \left\lfloor\frac{kn}{m}\right\rfloor+1\] (other protocols exist, such as using weighted averages).

**Examples:**

\(Q_1(1,3,4,6,7)=\frac{1}{2}\left(y_{\lfloor 5/4 \rfloor}+y_{\lfloor 5/4 \rfloor+1}\right)=\frac{1}{2}\left(y_{1}+y_{2}\right)=\frac{1}{2}(1+3)=2.\)

\(d_7(1,3,4,6,7,23)=\frac{1}{2}\left(y_{\lfloor 7(6)/10 \rfloor}+y_{\lfloor 7(6)/10 \rfloor+1}\right)=\frac{1}{2}\left(y_{4}+y_{5}\right)=\frac{1}{2}(6+7)=13/2.\)

\(Q_1(1,3,4,6,7,23)=\frac{1}{2}\left(y_{\lfloor 6/4 \rfloor}+y_{\lfloor 6/4 \rfloor+1}\right)=\frac{1}{2}\left(y_{1}+y_{2}\right)=\frac{1}{2}(1+3)=2.\)

\(Q_3(1,3,4,6,7,23)=\frac{1}{2}\left(y_{\lfloor 3(6)/4 \rfloor}+y_{\lfloor 3(6)/4 \rfloor+1}\right)=\frac{1}{2}\left(y_{4}+y_{5}\right)=\frac{1}{2}(6+7)=6.5.\)

Consider the following midterm grades:

```
grades<-c(
80,73,83,60,49,96,87,87,60,53,66,83,32,80,
66,90,72,55,76,46,48,69,45,48,77,52,59,97,
76,89,73,73,48,59,55,76,87,55,80,90,83,66,
80,97,80,55,94,73,49,32,76,57,42,94,80,90,
90,62,85,87,97,50,73,77,66,35,66,76,90,73,
80,70,73,94,59,52,81,90,55,73,76,90,46,66,
76,69,76,80,42,66,83,80,46,55,80,76,94,69,
57,55,66,46,87,83,49,82,93,47,59,68,65,66,
69,76,38,99,61,46,73,90,66,100,83,48,97,69,
62,80,66,55,28,83,59,48,61,87,72,46,94,48,
59,69,97,83,80,66,76,25,55,69,76,38,21,87,
52,90,62,73,73,89,25,94,27,66,66,76,90,83,
52,52,83,66,48,62,80,35,59,72,97,69,62,90,
48,83,55,58,66,100,82,78,62,73,55,84,83,66,
49,76,73,54,55,87,50,73,54,52,62,36,87,80,80
)
```

The quartiles and mean are:

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
21.00 55.00 70.00 68.74 82.50 100.00
```

#### Dispersion Measures

Some of the dispersion measures are fairly simple to compute: the **sample range** is
\[\text{range}(x_1,\ldots, x_n)=\max\{x_i\}-\min\{x_i\};\] the **inter-quartile range** is \(\text{IQR}=Q_3-Q_1\).

The **sample standard deviation** \(s\) and **sample variance** \(s^2\) are estimates of the underlying distribution’s \(\sigma\) and \(\sigma^2\). For observations \(x_1, \ldots, x_n\), \[s^2 = \frac{1}{n-1} \sum^{n}_{i=1} (x_i-\overline{x})^{2}=\frac{1}{n-1} \large(\sum^{n}_{i=1} x_{i}^{2} - \frac{1}{n}\left(\sum^{n}_{i=1} x_i\right)^{2}\large);\] it differs from the (population) standard deviation and the (population) variance in the denominator: \(n-1\) is used instead of \(n\).^{38}

**Examples:**

- The sample variance of \(\{1,3,4,6,7\}\) is \[\frac{1}{5-1}\left(\sum_{i=1}^5x_i^2-\frac{1}{5}\left(\sum_{i=1}^5x_i\right)^2\right)=\frac{1}{4}\left(111-\frac{1}{5}(21)^2\right)=5.7.\]
- The interquartile range of \(\{1,3,4,6,7,23\}\) is

\[\text{IQR}(1,3,4,6,7,23)=Q_3(1,3,4,6,7,23)-Q_1(1,3,4,6,7,23)=6.5-2=4.5.\]
3. We can provide more data descriptions of the `grades`

dataset (see above) using `psych`

’s `describe()`

function.

```
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 211 68.74 17.37 70 69.43 19.27 21 100 79 -0.37 -0.46 1.2
```

### 4.2.2 Outliers

An **outlier** is an observation that lies outside the overall pattern in a distribution.^{39}

Let \(x\) be an observation in the sample;^{40} it is a

**suspected outlier**if \[x<Q_1-1.5\,\mbox{IQR} ~~ \mbox{ or } ~~ x>Q_3+1.5\,\mbox{IQR},\]**definite outlier**if \[x<Q_1-3\,\mbox{IQR} ~~ \mbox{ or } ~~ x>Q_3+3\,\mbox{IQR}.\]

**Example:** in the set \(\{1,3,4,6,7,23\}\), \(Q_1=2\), \(Q_3=6.5\), and \(\text{IQR}=4.5\). Thus \[\begin{align*}
Q_1-1.5\text{IQR}&=2-1.5(4.5)= -4.75 \\
Q_3+1.5\text{IQR}&=6.5+1.5(4.5)= 13.25 \\
Q_1-3\text{IQR}&=2-3(4.5)= -11.5 \\
Q_3+3\text{IQR}&=6.5+3(4.5)= 20.0 \\
\end{align*}\]

Since \(23>Q_3+3\text{IQR}\) (and \(23>Q_3+1.5\text{IQR}\)), 23 is both a definite (and a suspected) outlier of \(\{1,3,4,6,7,23\}\).

### 4.2.3 Visual Summaries

The **boxplot** (also known as the box-and-whisker plot) is a quick and
easy way to present a graphical summary of a univariate distribution:

draw a box along the observation axis, with endpoints at the lower and upper quartiles \(Q_1\) (knees) and \(Q_3\) (shoulders), and with a “belt” at the median \(Q_2\);

draw a line extending from \(Q_1\) to the smallest value closer than \(1.5\text{IQR}\) to the left of \(Q_1\);

draw a line extending from \(Q_3\) to the largest value closer than \(1.5\text{IQR}\) to the right of \(Q_3\);

any suspected outlier is plotted separately (as in Figure @(fig:boxplot-def)):

#### Skewness

For **symmetric** distributions, the median and mean are equal, and the quartiles \(Q_1\) and \(Q_3\) are equidistant from \(Q_2\):

if \(Q_3-Q_2>Q_2-Q_1\) then the data distribution is

**skewed to the right**(positively skewed);if \(Q_3-Q_2<Q_2-Q_1\) then the data distribution is

**skewed to left**(negatively skewed).

Graphically, if the distance between the shoulders and the belt is larger than the distance between the belt and the knees, then the data is skewed to the right; if it’s the opposite, the data is skewed to the left.

In the boxplots below, the data is skewed to the right.

#### Histograms

Visual information about the distribution of the sample can also be
provided *via* **histograms**.

A histogram for the sample \(\{x_1,\ldots,x_n\}\) is built according to the following specifications:

the

**range**of the histogram is \(r=\max\{x_i\} - \min\{x_i\}\);the

**number of bins**should approach \(k=\sqrt{n}\), where \(n\) is the sample size;the

**bin width**should approach \(r/k\), andthe

**frequency of observations**in each bin should be represented by the**bin height**.

#### Shapes of Datasets

Boxplots and histograms provide an easy visual impression of the **shape of the data set**, which can eventually suggest a mathematical model for the situation of interest: another way to define skewness is to say that data is **skewed to the right** if the corresponding boxplot or histogram is stretched to the right, and *vice-versa*.

**Examples:**

- consider the daily number of car accidents in Sydney, Australia, over a 40-day period:

`6 3 2 24 12 3 7 14 21 9`

`14 22 15 2 17 10 7 7 31 7`

`18 6 8 2 3 2 17 7 7 21`

`13 23 1 11 3 9 4 9 9 25`

The sorted values are:

`1 2 2 2 2 3 3 3 3 4`

`6 6 7 7 7 7 7 7 8 9`

`9 9 9 10 11 12 13 14 14 15`

`17 17 18 21 21 22 23 24 25 31`

We can then easily see that

\[\begin{align*} \text{min}&=y_1=1,\quad Q_1=\frac{1}{2}(y_{10}+y_{11})=5,\quad \text{med}=\frac{1}{2}(y_{20}+y_{21})=9, \\ Q_3&=\frac{1}{2}(y_{30}+y_{31})=16,\quad \text{max}=y_{40}=31. \end{align*}\]

A corresponding histogram and boxplot are shown in Figure 4.5.

- A histogram and a boxplot can also be obtained for the
`grades`

dataset:

Here is a fancier version of the histogram, constructed with the `ggplot2`

package (see Section 9.5 for details on the use of this `R`

package).

```
fun.mode<-function(x){as.numeric(names(sort(-table(x)))[1])} # function to find the mode
ggplot2::ggplot(data=data.frame(grades), ggplot2::aes(grades)) +
ggplot2::geom_histogram(ggplot2::aes(y =..density..), # approximated pdf
breaks=seq(20, 100, by = 10), # 8 bins from 20 to 100
col="black", # colour of outline
fill="blue", # fill colour of bars
alpha=.2) + # transparency
ggplot2::geom_density(col=2) + # colour of pdf curve
ggplot2::geom_rug(ggplot2::aes(grades)) + # adding a rug on x-axis
ggplot2::geom_vline(ggplot2::aes(xintercept = mean(grades)),
col='red',size=2) + # vertical line: mean
ggplot2::geom_vline(ggplot2::aes(xintercept = median(grades)),
col='darkblue',size=2) + # vertical line: median
ggplot2::geom_vline(ggplot2::aes(xintercept = fun.mode(grades)),
col='black',size=2) # vertical line: mode
```

What is the shape of this dataset? Does it look like the class is in trouble?

### 4.2.4 Coefficient of Correlation

For bivariate (or multivariate) datasets, we can still study each variable separately, as in the previous sections, but we might also be interested in determining how the variables relate to one another.

For instance, consider the following data, consisting of \(n=20\) paired measurements \((x_i,y_i)\) of hydrocarbon levels \(x\) and pure oxygen levels \(y\) in fuels:

```
x = c(
0.99,1.02,1.15,1.29,1.46,1.36,0.87,1.23,1.55,1.40,
1.19,1.15,0.98,1.01,1.11,1.20,1.26,1.32,1.43,0.95
)
y = c(
90.01,89.05,91.43,93.74,96.73,94.45,87.59,91.77,99.42,93.65,
93.54,92.52,90.56,89.54,89.85,90.39,93.25,93.41,94.98,87.33
)
cbind(x,y)
```

```
x y
[1,] 0.99 90.01
[2,] 1.02 89.05
[3,] 1.15 91.43
[4,] 1.29 93.74
[5,] 1.46 96.73
[6,] 1.36 94.45
[7,] 0.87 87.59
[8,] 1.23 91.77
[9,] 1.55 99.42
[10,] 1.40 93.65
[11,] 1.19 93.54
[12,] 1.15 92.52
[13,] 0.98 90.56
[14,] 1.01 89.54
[15,] 1.11 89.85
[16,] 1.20 90.39
[17,] 1.26 93.25
[18,] 1.32 93.41
[19,] 1.43 94.98
[20,] 0.95 87.33
```

Assume that we are interested in measuring the **strength of association** between \(x\) and \(y\).

We can use a graphical display to provide an initial description of the relationship: it appears that the observations lie around a **hidden line**.

For paired data \((x_i,y_i)\), \(i=1,\ldots,n\), the **sample correlation coefficient** of \(x\) and \(y\) is \[\rho_{XY} = \frac{\sum (x_i-\overline{x})(y_i-\overline{y})}{\sqrt{\sum (x_i-\overline{x})^2\sum (y_i-\overline{y})^2}} =\frac{S_{xy}}{\sqrt{S_{xx}\,S_{yy}}}.\]

The coefficient \(\rho_{XY}\) is defined only if \(S_{xx}\neq 0\) and \(S_{yy}\neq 0\), i.e. if neither \(x_i\) nor \(y_i\) are constant.

The variables \(x\) and \(y\) are **uncorrelated** if \(\rho_{XY}=0\) (or is very small, in practice), and **correlated** if \(\rho_{XY} \neq 0\) (or if \(|\rho_{XY}|\) is “large”, in practice).

**Example:** for the data on the previous page, we have \[S_{xy}\approx 10.18,\ S_{xx}\approx 0.68,\ S_{yy}\approx 173.38,\] so that \[\rho_{XY} \approx \frac{10.18}{\sqrt{0.68\cdot 173.38}}\approx 0.94.\]

This can also be computed directly in `R`

:

```
(Sxx = sum((x-mean(x))^2))
(Syy = sum((y-mean(y))^2))
(Sxy = sum((x-mean(x))*(y-mean(y))))
(rho = Sxy/sqrt(Sxx*Syy))
```

```
[1] 0.68088
[1] 173.3769
[1] 10.17744
[1] 0.9367154
```

or by using the `cor()`

function:

`[1] 0.9367154`

#### Properties

\(\rho_{XY}\) is unaffected by changes of scale or origin. Adding constants to \(x\) does not change \(x-\overline x\) (similarly for \(y-\overline{y}\)) and multiplying \(x\) and \(y\) by constants changes both the numerator and denominator equally;

\(\rho_{XY}\) is symmetric in \(x\) and \(y\) (i.e. \(\rho_{XY}=\rho_{YX}\)) and \(-1 \leq \rho_{XY} \leq 1\); if \(\rho_{XY}=\pm 1\), then the observations \((x_i, y_i)\) all lie on a straight line with a positive (or negative) slope;

the sign of \(\rho_{XY}\) reflects the trend of the points;

a high correlation coefficient value \(|\rho_{XY}|\) does not necessarily imply a

**causal relationship**between the two variables;note that \(x\) and \(y\) can have a very strong

**non-linear**relationship without \(\rho_{XY}\) reflecting it (\(-0.12\) on the left, \(0.93\) on the right in Figure 4.6).