4.2 Descriptive Statistics

As its name implies, descriptive statistics aim to describe the data; examples include:

  • sample size (overall and/or subgroups);

  • demographic breakdowns of participants;

  • measures of central tendencies (e.g., mean, median, mode, etc.);

  • measures of variability (e.g., sample variance, minimum, maximum, interquartile range, etc.);

  • higher distribution moments (skew, kurtosis, etc.);

  • non-parametric measures (various quantiles);

  • derived measures (correlation coefficients), etc.

They can be presented as a single number, in a summary table, or even in graphical representations (e.g., histogram, pie chart, etc.).

4.2.1 Data Descriptions

Studies and experiments give rise to statistical units. These units are typically described with variables (and measurements), which are either qualitative (categorical) or quantitative (numerical).

Categorical variables take values (levels) from a finite set of pre-determined categories (or classes); numerical variables from a (potentially infinite) set of quantities.


Examples:

  1. Age is a numerical variable, measured in years, although is is often reported to the nearest year integer, or in an age range of years, in which case it is an ordinal variable (mixture of qualitative or quantitative).

  2. Typical numerical variables include distance in m, volume in cm\(^3\), etc.

  3. Disease diagnosis is a categorical variable with (at least) 2 categories (positive/negative).

  4. Compliance with a standard is a categorical variable: there could be 2 levels (compliant/non-compliant) or more (compliance, minor non-compliance issues, major non-compliance issues).

  5. Count variables are numerical variables.

Numerical Summaries

In a first pass, a variable can be described along (at least) 2 dimensions: its centrality and its spread (the skew and the kurtosis are sometimes also used):

  • centrality measures include the median, the mean, and, less frequently, the mode;

  • spread (or dispersion) measures include the standard deviation (sd), the quartiles, the inter-quartile range (IQR), and, less frequently, the range.

The median, range, and quartiles are all easily calculated from an ordered list of the data.

Sample Median

The median \(\text{med}(x_1,\ldots,x_n)\) of a sample of size \(n\) is a numerical value which splits the ordered data into \(2\) equal subsets: half the observations fall below the median, and half above it:

  • if \(n\) is odd, then the position of the median (or its rank) is \((n+1)/2\) – the median observation is the \(\frac{n+1}{2}^{\text{th}}\) ordered observation;

  • if \(n\) is even, then the median is the average of the \(\frac{n}{2}^{\text{th}}\) and the \((\frac{n}{2}+1)^{\text{th}}\) ordered observations.

The procedure is simple: order the data, and follow the even/odd rules to the letter.


Examples:

  1. \(\text{med}(4,6,1,3,7)=\text{med}(1,3,4,6,7)=x_{(5+1)/2}=x_3=4\). There are \(2\) observations below \(4\) \(\{1,3\}\), and \(2\) observations above \(4\) \(\{6,7\}\).

  2. \(\text{med}(1,3,4,6,7,23)=\frac{x_{6/2}+x_{6/2+1}}{2}=\frac{x_3+x_4}{2}=\frac{4+6}{2}=5\). There are \(3\) observations below \(5\) \(\{1,3,4\}\), and \(3\) observations above \(4\) \(\{6,7,23\}\).

  3. \(\text{med}(1,3,3,6,7)=x_{(5+1)/2}=x_3=3\). There seems to be only \(1\) observation below \(3\) \(\{1\}\), but \(2\) observations above \(3\) \(\{6,7\}\).


Note that there is ambiguity in the definition of the median: above and below should be interpreted as after and before, respectively, inclusive of the median value. In the last example above, for instance, there are \(2\) observations (\(x_1=1,x_2=3\)) before the median observation (\(x_3=3)\), and \(2\) after the median (\(x_4=6,x_5=7\)).

Sample Mean

The mean of a sample is simply the arithmetic average of its observations. For observations \(x_1, \ldots, x_n\), the sample mean is \[\begin{aligned} \text{AM}(x_1,\ldots,x_n)=\overline{x}= \frac{x_1+ \cdots+ x_n}{n} = \frac{1}{n}\left(\sum^{n}_{i=1} x_i\right)\end{aligned}\] Other means exist, such as the harmonic mean and the geometric mean: \[\begin{aligned} \text{HM}(x_1,\ldots,x_n)&=\frac{n}{\frac{1}{x_1}+\cdots+\frac{1}{x_n}} \\ \text{GM}(x_1,\ldots,x_n)&=\sqrt[n]{x_1\cdots x_n}. \end{aligned}\]

All of these measures attempt to find an “average” of the observations.


Examples:

  1. \(\text{AM}(4,6,1,3,7)=\frac{4+6+1+3+7}{5}=\frac{21}{5}=4.2 \approx 4= \text{med}(4,6,1,3,7)\).

  2. \(\text{AM}(1,3,4,6,7,23)=\frac{1+3+4+6+7+23}{6}=\frac{44}{6}\approx 7.3\), which is not nearly as close to \(\text{med}(1,3,4,6,7,23)=5\).

  3. \(\text{HM}(4,6,1,3,7)=\frac{5}{\frac{1}{4}+\frac{1}{6}+\frac{1}{1}+\frac{1}{3}+\frac{1}{7}}=\frac{5}{53/28}=\frac{140}{53}\approx 2.64\).

  4. \(\text{GM}(4,6,1,3,7)=\sqrt[5]{4\cdot 6\cdot 1\cdot 3\cdot 7}\approx \sqrt[5](504)\approx 3.47\).


It can be shown that if \(x=(x_1,\ldots,x_n)\) and \(x_i>0\) for all \(i\), then \[\min(x)\leq\text{HM}(x)\leq \text{GM}(x)\leq \text{AM}(x)\leq \max(x).\] There is no need to decide on a single centrality measure when reporting on the data; in practice, we may use as many of them as we want to.

But there are situations where the mean (or the median) could prove to be a better choice. On the one hand, the use of the mean is theoretically supported by the Central Limit Theorem.

When the data distribution is roughly symmetric, then the median and the mean will be near one another. If the data distribution is skewed then the mean is pulled toward the long tail and as a result gives a distorted view of the centre (see Figure 4.1).

Consequently, medians are generally used for house prices, incomes, etc., as the median is robust against outliers and incorrect readings (whereas the mean is not).

Mean, median, mode in various skewness scenarios. [source unknown]

Figure 4.1: Mean, median, mode in various skewness scenarios. [source unknown]

Standard Deviation

While the mean, the median, and the mode provide an idea as to where some of the distribution’s “mass” is located, the standard deviation provides some notion of its spread. The higher the standard deviation, the further away from the mean the variable values are likely to fall (see Figure 4.2). We will have more to say on this topic.

Normal distributions, with various means and standard deviations. [Wikipedia]

Figure 4.2: Normal distributions, with various means and standard deviations. Wikipedia

Quantiles

Another way to provide information about the spread of the data is via centiles, deciles, and/or quartiles.

The lower quartile \(Q_1(x_1,\ldots,x_n)\) of a sample of size \(n\), or \(Q_1\), is a numerical value which splits the ordered data into \(2\) unequal subsets: \(25\)% of the observations fall below \(Q_1\) and \(75\)% of the observations fall above \(Q_1\).

Similarly, the upper quartile \(Q_3\) splits the ordered data into \(75\)% of the observations below \(Q_3\), and \(25\)% of the observations above \(Q_3\).

The median can be interpreted as the middle quartile \(Q_2\), of the sample, the minimum as \(Q_0\), and the maximum as \(Q_4\): the vector \((Q_0,Q_1,Q_2,Q_3,Q_4)\) is the 5-pt summary of the data.

Centiles \(p_i\), \(i=0,\ldots, 100\) and deciles \(d_j\), \(j=0,\ldots, 10\) run through different splitting percentages \[p_{25}=Q_1, p_{75}=Q_3, d_5=Q_2,\ \text{etc.}\]

They are found as with the media: sort the sample observations \(\{x_1, x_2, \ldots, x_n\}\) in an increasing order as \[y_1\leq y_2\leq\ldots\leq y_n.\] The smallest \(y_1\) has rank \(1\) and the largest \(y_n\) has rank \(n\).

Any value that falls between the observations of ranks:

  • \(\lfloor\frac{n}{4}\rfloor\) and \(\lfloor\frac{n}{4}\rfloor+1\) is a lower quartile \(Q_1\);

  • \(\lfloor\frac{3n}{4}\rfloor\) and \(\lfloor\frac{3n}{4}\rfloor+1\) is an upper quartile \(Q_3\);

  • \(\lfloor\frac{in}{100}\rfloor\) and \(\lfloor\frac{in}{100}\rfloor+1\) is a centile \(p_i\), for \(i=1,\ldots,99\);

  • \(\lfloor\frac{jn}{10}\rfloor\) and \(\lfloor\frac{jn}{10}\rfloor+1\) is a decile \(d_j\), for \(j=1,\ldots,9\).

In practice, we compute the \(m-\)quantile of order \(k\) for the data, where \(k=1,\ldots,m-1\) by averaging the observations of rank \[\left\lfloor\frac{kn}{m}\right\rfloor\quad\text{and}\quad \left\lfloor\frac{kn}{m}\right\rfloor+1\] (other protocols exist, such as using weighted averages).


Examples:

  1. \(Q_1(1,3,4,6,7)=\frac{1}{2}\left(y_{\lfloor 5/4 \rfloor}+y_{\lfloor 5/4 \rfloor+1}\right)=\frac{1}{2}\left(y_{1}+y_{2}\right)=\frac{1}{2}(1+3)=2.\)

  2. \(d_7(1,3,4,6,7,23)=\frac{1}{2}\left(y_{\lfloor 7(6)/10 \rfloor}+y_{\lfloor 7(6)/10 \rfloor+1}\right)=\frac{1}{2}\left(y_{4}+y_{5}\right)=\frac{1}{2}(6+7)=13/2.\)

  3. \(Q_1(1,3,4,6,7,23)=\frac{1}{2}\left(y_{\lfloor 6/4 \rfloor}+y_{\lfloor 6/4 \rfloor+1}\right)=\frac{1}{2}\left(y_{1}+y_{2}\right)=\frac{1}{2}(1+3)=2.\)

  4. \(Q_3(1,3,4,6,7,23)=\frac{1}{2}\left(y_{\lfloor 3(6)/4 \rfloor}+y_{\lfloor 3(6)/4 \rfloor+1}\right)=\frac{1}{2}\left(y_{4}+y_{5}\right)=\frac{1}{2}(6+7)=6.5.\)

  5. Consider the following midterm grades:

grades<-c(
  80,73,83,60,49,96,87,87,60,53,66,83,32,80,
  66,90,72,55,76,46,48,69,45,48,77,52,59,97,
  76,89,73,73,48,59,55,76,87,55,80,90,83,66,
  80,97,80,55,94,73,49,32,76,57,42,94,80,90,
  90,62,85,87,97,50,73,77,66,35,66,76,90,73,
  80,70,73,94,59,52,81,90,55,73,76,90,46,66,
  76,69,76,80,42,66,83,80,46,55,80,76,94,69,
  57,55,66,46,87,83,49,82,93,47,59,68,65,66,
  69,76,38,99,61,46,73,90,66,100,83,48,97,69,
  62,80,66,55,28,83,59,48,61,87,72,46,94,48,
  59,69,97,83,80,66,76,25,55,69,76,38,21,87,
  52,90,62,73,73,89,25,94,27,66,66,76,90,83,
  52,52,83,66,48,62,80,35,59,72,97,69,62,90,
  48,83,55,58,66,100,82,78,62,73,55,84,83,66,
  49,76,73,54,55,87,50,73,54,52,62,36,87,80,80
  )

The quartiles and mean are:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  21.00   55.00   70.00   68.74   82.50  100.00 

Dispersion Measures

Some of the dispersion measures are fairly simple to compute: the sample range is \[\text{range}(x_1,\ldots, x_n)=\max\{x_i\}-\min\{x_i\};\] the inter-quartile range is \(\text{IQR}=Q_3-Q_1\).

The sample standard deviation \(s\) and sample variance \(s^2\) are estimates of the underlying distribution’s \(\sigma\) and \(\sigma^2\). For observations \(x_1, \ldots, x_n\), \[s^2 = \frac{1}{n-1} \sum^{n}_{i=1} (x_i-\overline{x})^{2}=\frac{1}{n-1} \large(\sum^{n}_{i=1} x_{i}^{2} - \frac{1}{n}\left(\sum^{n}_{i=1} x_i\right)^{2}\large);\] it differs from the (population) standard deviation and the (population) variance in the denominator: \(n-1\) is used instead of \(n\).38


Examples:

  1. The sample variance of \(\{1,3,4,6,7\}\) is \[\frac{1}{5-1}\left(\sum_{i=1}^5x_i^2-\frac{1}{5}\left(\sum_{i=1}^5x_i\right)^2\right)=\frac{1}{4}\left(111-\frac{1}{5}(21)^2\right)=5.7.\]
  2. The interquartile range of \(\{1,3,4,6,7,23\}\) is

\[\text{IQR}(1,3,4,6,7,23)=Q_3(1,3,4,6,7,23)-Q_1(1,3,4,6,7,23)=6.5-2=4.5.\] 3. We can provide more data descriptions of the grades dataset (see above) using psych’s describe() function.

   vars   n  mean    sd median trimmed   mad min max range  skew kurtosis  se
X1    1 211 68.74 17.37     70   69.43 19.27  21 100    79 -0.37    -0.46 1.2

4.2.2 Outliers

An outlier is an observation that lies outside the overall pattern in a distribution.39

Let \(x\) be an observation in the sample;40 it is a

  • suspected outlier if \[x<Q_1-1.5\,\mbox{IQR} ~~ \mbox{ or } ~~ x>Q_3+1.5\,\mbox{IQR},\]

  • definite outlier if \[x<Q_1-3\,\mbox{IQR} ~~ \mbox{ or } ~~ x>Q_3+3\,\mbox{IQR}.\]


Example: in the set \(\{1,3,4,6,7,23\}\), \(Q_1=2\), \(Q_3=6.5\), and \(\text{IQR}=4.5\). Thus \[\begin{align*} Q_1-1.5\text{IQR}&=2-1.5(4.5)= -4.75 \\ Q_3+1.5\text{IQR}&=6.5+1.5(4.5)= 13.25 \\ Q_1-3\text{IQR}&=2-3(4.5)= -11.5 \\ Q_3+3\text{IQR}&=6.5+3(4.5)= 20.0 \\ \end{align*}\]

Since \(23>Q_3+3\text{IQR}\) (and \(23>Q_3+1.5\text{IQR}\)), 23 is both a definite (and a suspected) outlier of \(\{1,3,4,6,7,23\}\).

4.2.3 Visual Summaries

The boxplot (also known as the box-and-whisker plot) is a quick and easy way to present a graphical summary of a univariate distribution:

  1. draw a box along the observation axis, with endpoints at the lower and upper quartiles \(Q_1\) (knees) and \(Q_3\) (shoulders), and with a “belt” at the median \(Q_2\);

  2. draw a line extending from \(Q_1\) to the smallest value closer than \(1.5\text{IQR}\) to the left of \(Q_1\);

  3. draw a line extending from \(Q_3\) to the largest value closer than \(1.5\text{IQR}\) to the right of \(Q_3\);

  4. any suspected outlier is plotted separately (as in Figure @(fig:boxplot-def)):

Boxplot with one (suspected) outlier.

Figure 4.3: Boxplot with one (suspected) outlier.

Skewness

For symmetric distributions, the median and mean are equal, and the quartiles \(Q_1\) and \(Q_3\) are equidistant from \(Q_2\):

  • if \(Q_3-Q_2>Q_2-Q_1\) then the data distribution is skewed to the right (positively skewed);

  • if \(Q_3-Q_2<Q_2-Q_1\) then the data distribution is skewed to left (negatively skewed).

Graphically, if the distance between the shoulders and the belt is larger than the distance between the belt and the knees, then the data is skewed to the right; if it’s the opposite, the data is skewed to the left.

In the boxplots below, the data is skewed to the right.

Positively skewed datasets.

Figure 4.4: Positively skewed datasets.

Histograms

Visual information about the distribution of the sample can also be provided via histograms.

A histogram for the sample \(\{x_1,\ldots,x_n\}\) is built according to the following specifications:

  • the range of the histogram is \(r=\max\{x_i\} - \min\{x_i\}\);

  • the number of bins should approach \(k=\sqrt{n}\), where \(n\) is the sample size;

  • the bin width should approach \(r/k\), and

  • the frequency of observations in each bin should be represented by the bin height.

Shapes of Datasets

Boxplots and histograms provide an easy visual impression of the shape of the data set, which can eventually suggest a mathematical model for the situation of interest: another way to define skewness is to say that data is skewed to the right if the corresponding boxplot or histogram is stretched to the right, and vice-versa.


Examples:

  1. consider the daily number of car accidents in Sydney, Australia, over a 40-day period:

6 3 2 24 12 3 7 14 21 9
14 22 15 2 17 10 7 7 31 7
18 6 8 2 3 2 17 7 7 21
13 23 1 11 3 9 4 9 9 25

The sorted values are:

1 2 2 2 2 3 3 3 3 4
6 6 7 7 7 7 7 7 8 9
9 9 9 10 11 12 13 14 14 15
17 17 18 21 21 22 23 24 25 31

We can then easily see that

\[\begin{align*} \text{min}&=y_1=1,\quad Q_1=\frac{1}{2}(y_{10}+y_{11})=5,\quad \text{med}=\frac{1}{2}(y_{20}+y_{21})=9, \\ Q_3&=\frac{1}{2}(y_{30}+y_{31})=16,\quad \text{max}=y_{40}=31. \end{align*}\]

A corresponding histogram and boxplot are shown in Figure 4.5.

Histogram and boxplot of the Sydney accident dataset.

Figure 4.5: Histogram and boxplot of the Sydney accident dataset.

  1. A histogram and a boxplot can also be obtained for the grades dataset:
hist(grades, breaks = seq(20,100,10))

boxplot(grades)

Here is a fancier version of the histogram, constructed with the ggplot2 package (see Section 9.5 for details on the use of this R package).

fun.mode<-function(x){as.numeric(names(sort(-table(x)))[1])} # function to find the mode

ggplot2::ggplot(data=data.frame(grades), ggplot2::aes(grades)) + 
    ggplot2::geom_histogram(ggplot2::aes(y =..density..),    # approximated pdf
                 breaks=seq(20, 100, by = 10),               # 8 bins from 20 to 100 
                 col="black",                                # colour of outline
                 fill="blue",                                # fill colour of bars
                 alpha=.2) +                                 # transparency
    ggplot2::geom_density(col=2) +                           # colour of pdf curve
    ggplot2::geom_rug(ggplot2::aes(grades)) +                # adding a rug on x-axis
    ggplot2::geom_vline(ggplot2::aes(xintercept = mean(grades)),
                        col='red',size=2) +                  # vertical line: mean
    ggplot2::geom_vline(ggplot2::aes(xintercept = median(grades)),
                        col='darkblue',size=2) +             # vertical line: median
    ggplot2::geom_vline(ggplot2::aes(xintercept = fun.mode(grades)),
                        col='black',size=2)                  # vertical line:  mode

What is the shape of this dataset? Does it look like the class is in trouble?

4.2.4 Coefficient of Correlation

For bivariate (or multivariate) datasets, we can still study each variable separately, as in the previous sections, but we might also be interested in determining how the variables relate to one another.

For instance, consider the following data, consisting of \(n=20\) paired measurements \((x_i,y_i)\) of hydrocarbon levels \(x\) and pure oxygen levels \(y\) in fuels:

x = c(
  0.99,1.02,1.15,1.29,1.46,1.36,0.87,1.23,1.55,1.40,
  1.19,1.15,0.98,1.01,1.11,1.20,1.26,1.32,1.43,0.95
)
y = c(
  90.01,89.05,91.43,93.74,96.73,94.45,87.59,91.77,99.42,93.65,
  93.54,92.52,90.56,89.54,89.85,90.39,93.25,93.41,94.98,87.33
)
cbind(x,y)
         x     y
 [1,] 0.99 90.01
 [2,] 1.02 89.05
 [3,] 1.15 91.43
 [4,] 1.29 93.74
 [5,] 1.46 96.73
 [6,] 1.36 94.45
 [7,] 0.87 87.59
 [8,] 1.23 91.77
 [9,] 1.55 99.42
[10,] 1.40 93.65
[11,] 1.19 93.54
[12,] 1.15 92.52
[13,] 0.98 90.56
[14,] 1.01 89.54
[15,] 1.11 89.85
[16,] 1.20 90.39
[17,] 1.26 93.25
[18,] 1.32 93.41
[19,] 1.43 94.98
[20,] 0.95 87.33

Assume that we are interested in measuring the strength of association between \(x\) and \(y\).

We can use a graphical display to provide an initial description of the relationship: it appears that the observations lie around a hidden line.

plot(x,y)

For paired data \((x_i,y_i)\), \(i=1,\ldots,n\), the sample correlation coefficient of \(x\) and \(y\) is \[\rho_{XY} = \frac{\sum (x_i-\overline{x})(y_i-\overline{y})}{\sqrt{\sum (x_i-\overline{x})^2\sum (y_i-\overline{y})^2}} =\frac{S_{xy}}{\sqrt{S_{xx}\,S_{yy}}}.\]

The coefficient \(\rho_{XY}\) is defined only if \(S_{xx}\neq 0\) and \(S_{yy}\neq 0\), i.e. if neither \(x_i\) nor \(y_i\) are constant.

The variables \(x\) and \(y\) are uncorrelated if \(\rho_{XY}=0\) (or is very small, in practice), and correlated if \(\rho_{XY} \neq 0\) (or if \(|\rho_{XY}|\) is “large”, in practice).


Example: for the data on the previous page, we have \[S_{xy}\approx 10.18,\ S_{xx}\approx 0.68,\ S_{yy}\approx 173.38,\] so that \[\rho_{XY} \approx \frac{10.18}{\sqrt{0.68\cdot 173.38}}\approx 0.94.\]

This can also be computed directly in R:

(Sxx = sum((x-mean(x))^2))
(Syy = sum((y-mean(y))^2))
(Sxy = sum((x-mean(x))*(y-mean(y))))
(rho = Sxy/sqrt(Sxx*Syy))
[1] 0.68088
[1] 173.3769
[1] 10.17744
[1] 0.9367154

or by using the cor() function:

cor(x,y)
[1] 0.9367154

Properties

  • \(\rho_{XY}\) is unaffected by changes of scale or origin. Adding constants to \(x\) does not change \(x-\overline x\) (similarly for \(y-\overline{y}\)) and multiplying \(x\) and \(y\) by constants changes both the numerator and denominator equally;

  • \(\rho_{XY}\) is symmetric in \(x\) and \(y\) (i.e. \(\rho_{XY}=\rho_{YX}\)) and \(-1 \leq \rho_{XY} \leq 1\); if \(\rho_{XY}=\pm 1\), then the observations \((x_i, y_i)\) all lie on a straight line with a positive (or negative) slope;

  • the sign of \(\rho_{XY}\) reflects the trend of the points;

  • a high correlation coefficient value \(|\rho_{XY}|\) does not necessarily imply a causal relationship between the two variables;

  • note that \(x\) and \(y\) can have a very strong non-linear relationship without \(\rho_{XY}\) reflecting it (\(-0.12\) on the left, \(0.93\) on the right in Figure 4.6).

Examples of strong relationships that are not reflected by the coefficient of correlation.

Figure 4.6: Examples of strong relationships that are not reflected by the coefficient of correlation.