8.4 Missing Values

Obviously, the best way to treat missing data is not to have any (T. Orchard, M. Woodbury, [149]).

Why does it matter that some values may be missing?

As a start, missing values can potentially introduce bias into the analysis, which is rarely (if at all) a good thing, but, more pragmatically, they may interfere with the functioning of most analytical methods, which cannot easily accommodate missing observations without breaking down.135

Consequently, when faced with missing observations, analysts have two options: they can either discard the missing observation (which is not typically recommended, unless the data is missing completely randomly), or they can create a replacement value for the missing observation (the imputation strategy has drawbacks since we can never be certain that the replacement value is the true value, but is often the best available option; information in this section is taken partly from [150][153]).

Blank fields come in 4 flavours:

  • nonresponse – an observation was expected but none was entered;

  • data entry issues – an observation was recorded but was not entered in the dataset;

  • invalid entries – an observation was recorded but was considered invalid and has been removed, and

  • expected blanks – a field has been left blank, but not unexpectedly so.

Too many missing values of the first three types can be indicative of issues with the data collection process, while too many missing values of the fourth type can be indicative of poor questionnaire design (see [154] for a brief discussion on these topics).

Either way, missing values cannot simply be ignored: either the

  • corresponding record is removed from the dataset (not recommended without justification, as doing so may cause a loss of auxiliary information and may bias the analysis results), or

  • missing value must be imputed (that is to say, a reasonable replacement value must be found).

8.4.1 Missing Value Mechanisms

The relevance of an imputation method is dependent on the underlying missing value mechanism; values may be:

  • missing completely at random (MCAR) – the item absence is independent of its value or of the unit’s auxiliary variables (e.g., an electrical surge randomly deletes an observation in the dataset);

  • missing at random (MAR) – the item absence is not completely random, and could, in theory, be accounted by the unit’s complete auxiliary information, if available (e.g., if women are less likely to tell you their age than men for societal reasons, but not because of the age values themselves), and

  • not missing at random (NMAR) – the reason for nonresponse is related to the item value itself (e.g., if illicit drug users are less likely to admit to drug use than teetotallers).

The analyst’s main challenge in that regard is that the missing mechanism cannot typically be determined with any degree of certainty.

8.4.2 Imputation Methods

There are numerous statistical imputation methods. They each have their strengths and weaknesses; consequently, consultants and analysts should take care to select a method which is appropriate for the situation at hand.136

  • In list-wise deletion, all units with at least one missing value are removed from the dataset. This straightforward imputation strategy assumes MCAR, but it can introduce bias if MCAR does not hold, and it leads to a reduction in the sample size and an increase in standard errors.

  • In mean or most frequent imputation, the missing values are substituted by the average or most frequent value in the unit’s subpopulation group (stratum). This commonly-used approach also assumes MCAR, but it can create distortions in the underlying distributions (such as a spike at the mean) and create spurious relationships among variables.

  • In regression or correlation imputation, the missing values are substituted using a regression on the other variables. This model assumes MAR and trains the regression on units with complete information, in order to take full advantage of the auxiliary information when it is available. However, it artificially reduces data variability and produces over-estimates of correlations.

  • In stochastic regression imputation, the regression estimates are augmented with random error terms added. Just as in regression estimation, the model assumes MAR; an added benefit is that it tends to produce estimates that “look” more realistic than regression imputation, but it comes with an increased risk of type I error (false positives) due to small standard errors.

  • Last observation carried forward (LOCF) and its cousin next observation carried backward (NOCB) are useful for longitudinal data; a missing value can simply be substituted by the previous or next value. LOCF and NOCB can be used when the values do not vary greatly from one observation to the next, and when values are MCAR. Their main drawback is that they may be too “generous” for studies that are trying to determine the effect of a treatment over time, say.

  • Finally, in \(k\)-nearest-neighbour imputation, a missing entry in a MAR scenario is substituted by the average (or median, or mode) value from the subgroup of the \(k\) most similar complete respondents. This requires a notion of similarity between units (which is not always easy to define reasonably). The choice of \(k\) is somewhat arbitrary and can affect the imputation, potentially distorting the data structure when it is too large.

What does imputation look like in practice? Consider the following scenario (which is, somewhat embarrassingly, based on a true story).


Example: after marking the final exams of the 211 students who did not drop her course in Advanced Retroencabulation at State University, Dr. Helga Vanderwhede creates a data frame grades of final exam grades and mid term-grades.

MT=c(
 80,73,83,60,49,96,87,87,60,53,66,83,32,80,66,90,72,55,76,46,
 48,69,45,48,77,52,59,97,76,89,73,73,48,59,55,76,87,55,80,90,
 83,66,80,97,80,55,94,73,49,32,76,57,42,94,80,90,90,62,85,87,
 97,50,73,77,66,35,66,76,90,73,80,70,73,94,59,52,81,90,55,73,
 76,90,46,66,76,69,76,80,42,66,83,80,46,55,80,76,94,69,57,55,
 66,46,87,83,49,82,93,47,59,68,65,66,69,76,38,99,61,46,73,90,
 66,100,83,48,97,69,62,80,66,55,28,83,59,48,61,87,72,46,94,48,
 59,69,97,83,80,66,76,25,55,69,76,38,21,87,52,90,62,73,73,89,
 25,94,27,66,66,76,90,83,52,52,83,66,48,62,80,35,59,72,97,69,
 62,90,48,83,55,58,66,100,82,78,62,73,55,84,83,66,49,76,73,54,  
 55,87,50,73,54,52,62,36,87,80,80
)

FE=c(
41,54,93,49,92,85,37,92,61,42,74,84,61,21,75,49,36,62,92,85,
50,90,52,63,64,85,66,51,41,75,4,46,38,71,42,18,76,42,94,53,
77,65,95,3,74,0,97,62,74,61,80,47,39,92,59,37,59,71,20,67,
69,88,53,52,81,41,81,48,67,65,92,75,68,55,67,51,83,71,58,37,
65,66,51,43,83,34,55,59,20,62,22,70,64,59,73,74,73,53,44,36,
62,45,80,85,41,80,84,44,73,72,60,65,78,60,34,91,40,41,54,91,
49,92,85,37,92,61,42,74,84,61,21,75,49,36,62,92,85,50,92,52,
63,64,85,66,51,41,75,4,46,38,71,42,18,76,42,92,53,77,65,92,
3,74,0,52,62,74,61,80,47,39,92,59,37,59,71,20,67,69,88,53,
52,81,41,81,48,67,65,94,75,68,55,67,51,83,71,58,37,65,66,51,
43,83,34,55,59,20,62,22,70,64,59
)

grades=data.frame(MT,FE)
summary(grades)
MT FE
Min. : 21.00 Min. : 0.00
1st Qu.: 55.00 1st Qu.:46.50
Median : 70.00 Median :62.00
Mean : 68.74 Mean :60.09
3rd Qu.: 82.50 3rd Qu.:75.00
Max. :100.00 Max. :97.00

She plots the final exam grades (\(y\)) against the mid-term exam grades (\(x\)), as seen below.

hist(MT, xlim=c(0,100), xlab=c("Midterm Grades"))

hist(FE, xlim=c(0,100), xlab=c("Final Exam Grades"))

plot(grades, xlim=c(0,100), ylim=c(0,100), 
     xlab=c("Midterm Grade"), ylab=c("Final Exam grade"), 
     main=c("Course Results"))

Looking at the data, she sees that final exam grades are weakly correlated with mid-term exam grades: students who performed well on the mid-term tended to perform well on the final, and students who performed poorly on the mid-term tended to perform poorly on the final (as is usually the case), but the link is not that strong.

cor(grades$MT,grades$FE)
[1] 0.5481776

She also sees that there is a fair amount of variability in the data: the noise is not very tight around the (eye-balled) line of best fit.

The linear regression model is:

model <- lm(FE ~ MT, data=grades)   

The lm() summary is:

model 
summary(model)

Call:
lm(formula = FE ~ MT, data = grades)

Coefficients:
(Intercept)           MT  
    14.0097       0.6704  


Call:
lm(formula = FE ~ MT, data = grades)

Residuals:
    Min      1Q  Median      3Q     Max 
-76.035  -8.759   2.131  10.987  45.142 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 14.00968    5.01523   2.793   0.0057 ** 
MT           0.67036    0.07075   9.475   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.81 on 209 degrees of freedom
Multiple R-squared:  0.3005,    Adjusted R-squared:  0.2972 
F-statistic: 89.78 on 1 and 209 DF,  p-value: < 2.2e-16

She then plots the line of best fit and the residual plots as follows:

ggplot2::ggplot(model) + ggplot2::geom_point(ggplot2::aes(x=MT, y=FE)) +   
    ggplot2::geom_line(ggplot2::aes(x=MT, y=.fitted), color="blue" ) +
    ggplot2::theme_bw() + 
    ggplot2::xlab(c("Midterm Grade")) + 
    ggplot2::ylab(c("Final Exam Grade")) +  
    ggplot2::ggtitle(c("Line of Best Fit"))

ggplot2::ggplot(model) + ggplot2::geom_point(ggplot2::aes(x=MT, y=FE)) +   
    ggplot2::geom_line(ggplot2::aes(x=MT, y=.fitted), color="blue" ) + 
    ggplot2::geom_linerange(ggplot2::aes(x=MT, ymin=.fitted, ymax=FE), color="red") +
    ggplot2::theme_bw() + 
    ggplot2::xlab(c("Midterm Grade")) + 
    ggplot2::ylab(c("Final Exam Grade")) +  
    ggplot2::ggtitle(c("Residuals"))

Furthermore, she realizes that the final exam was harder than the students expected (as the slope of the line of best fit is smaller than \(1\), as only 29% of observations lie above the line MT=FE) – she suspects that they simply did not prepare for the exam seriously (and not that she made the exam too difficult, no matter what her ratings on RateMyProfessor.com suggest), as most of them could not match their mid-term exam performance.

sum(grades$MT <= grades$FE)/nrow(grades)

plot(grades, xlim=c(0,100), ylim=c(0,100), 
     xlab=c("Midterm Grade"), ylab=c("Final Exam grade"), 
     main=c("Course Results"))
abline(a=0, b=1, col="red")

[1] 0.2890995

As Dr. Vanderwhede comes to terms with her disappointment, she takes a deeper look at the numbers, at some point sorting the dataset according to the mid-term exam grades.

s.grades <- grades[order(-grades$MT),]
head(s.grades,16)
MT FE
122 100 92
188 100 94
116 99 91
28 97 51
44 97 3
61 97 69
125 97 92
143 97 85
179 97 88
6 96 85
47 94 97
54 94 92
74 94 55
97 94 73
139 94 92
162 94 74

It looks like good old Mary Sue (row number 47) performed better on the final than on the mid-term (where her performance was already superlative), scoring the highest grade. What a great student Mary Sue is!137

plot(s.grades[,c("MT","FE")], xlim=c(0,100), ylim=c(0,100), 
     col=ifelse(row.names(s.grades)=="47",'red','black'),
     pch=ifelse(row.names(s.grades)=="47",22,1), bg='red',
     xlab=c("Midterm Grade"), ylab=c("Final Exam grade"), 
     main=c("Mary Sue!"))

She continues to toy with the spreadsheet until the phone rings. After a long and exhausting conversation with Dean Bitterman about teaching loads and State University’s reputation, Dr. Vanderwhede returns to the spreadsheet and notices in horror that she has accidentally deleted the final exam grades of all students with a mid-term grade greater than 93.

s.grades$FE.NA <- ifelse(s.grades$MT>93,NA,s.grades$FE)

What is she to do? Anyone with a modicum of technical savvy would advise her to either undo her changes or to close the file without saving the changes,138 but in full panic mode, the only solution that comes to her mind is to impute the missing values.

She knows that the missing final grades are MAR (and not MCAR since she remembers sorting the data along the MT values); she produces the imputations shown in Figure 8.4.

Imputed values for Dr. Vanderwhede's dataset - 5 approaches.Imputed values for Dr. Vanderwhede's dataset - 5 approaches.Imputed values for Dr. Vanderwhede's dataset - 5 approaches.Imputed values for Dr. Vanderwhede's dataset - 5 approaches.Imputed values for Dr. Vanderwhede's dataset - 5 approaches.

Figure 8.4: Imputed values for Dr. Vanderwhede’s dataset - original data, list-wise deletion, mean imputation, regression imputation, stochastic imputation.

She remembers what the data looked like originally, and concludes that the best imputation method is the stochastic regression model.

This conclusion only applies to this specific example, however. In general, that might not be the case due to various No Free Lunch results.139

The main take-away from this example is that various imputation strategies lead to different outcomes, and perhaps more importantly, that even though the imputed data might “look” like the true data, we have no way to measure its departure from reality – any single imputed value is likely to be completely off.

Mathematically, this might not be problematic, as the average departure is likely to be relatively small, but in a business context or a personal one, this might create gigantic problems – how is Mary Sue likely to feel about Dr.Vanderwhede’s solution to her conundrum?

s.grades[row.names(s.grades)=="47",c("MT","FE","FE.NA.reg")]
MT FE FE.NA.reg
47 94 97 77.54035

And how would Dean Bitterman react were he to find out about the imputation scenario from irate students? The solution has to be compatible with the ultimate data science objective: from Dr. Vanderwhede’s perspective, perhaps the only thing that matters is capturing the essence of the students’ performance, but from the student’s perspective…

Even though such questions are not quantitative in nature, their answer will impact any actionable solution.

8.4.3 Multiple Imputation

Another drawback of imputation is that it tends to increase the noise in the data, because the imputed data is treated as the actual data.

In multiple imputation, the impact of that noise can be reduced by consolidating the analysis outcome from multiple imputed datasets. Once an imputation strategy has been selected on the basis of the (assumed) missing value mechanism,

  1. the imputation process is repeated \(m\) times to produce \(m\) versions of the dataset (assuming a stochastic procedure – if the imputed dataset is always the same, this procedure is worthless);

  2. each of these datasets is analyzed, yielding \(m\) outcomes, and

  3. the \(m\) outcomes are pooled into a single result for which the mean, variance, and confidence intervals are known.

On the plus side, multiple imputation is easy to implement, flexible, as it can be used in a most situations (MCAR, MAR, even NMAR in certain cases), and it accounts for uncertainty in the imputed values.

However, \(m\) may need to be quite large when the values are missing in large quantities from many of the dataset’s features, which can substantially slow down the analyses.

There may also be additional technical challenges when the output of the analyses is not a single value but some more complicated object. A generalization of multiple imputation was used by Transport Canada to predict the Blood Alcohol Level (BAC) content level in fatal traffic collisions that involved pedestrians [155].

References

[149]
T. Orchard and M. Woodbury, A missing information principle: Theory and applications. University of California Press, 1972.
[150]
S. Hagiwara, “Nonresponse error in survey sampling: Comparison of different imputation methods.” Honours Thesis; School of Mathematics; Statistics, Carleton University, 2012.
[153]
D. B. Rubin, Multiple imputation for nonresponse in surveys. Wiley, 1987.
[154]
P. Boily, Principles of data collection,” Data Science Report Series, 2020.
[155]