8.3 Data Quality

Calvin’s Dad: OK Calvin. Let’s check over your math homework.
Calvin: Let’s not and say we did.
Calvin’s Dad: Your teacher says you need to spend more time on it. Have a seat.
Calvin: More time?! I already spent 10 whole minutes on it! 10 minutes shot! Wasted! Down the drain!
Calvin’s Dad: You’ve written here \(8+4=7\). Now you know that’s not right.
Calvin: So I was off a little bit. Sue me.
Calvin’s Dad: You can’t add things and come with less than you started with!
Calvin: I can do that! It’s a free country! I’ve got my rights!
(B. Watterson, Calvin and Hobbes, 15-09-1990.)

The quality of the data has an important effect on the quality of the results: as the saying goes: “garbage in, garbage out.” Data is said to be sound when it has as few issues as possible with:

  • validity – are observations sensible, given data type, range, mandatory response, uniqueness, value, regular expressions, etc. (e.g. a value that is expected to be text value is a number, a value that is expected to be positive is negative, etc.)?;

  • completeness – are there missing observations (more on this in a subsequent section)?;

  • accuracy and precision – are there measurement and/or data entry errors (e.g. an individual has \(3\) children but only \(2\) are recorded, etc., see Figure 8.2, linking accuracy to bias and precision to the standard error)?;

  • consistency – are there conflicting observations (e.g. an individual has no children, but the age of one kid is recorded, etc.)?, and

  • uniformity – are units used uniformly throughout (e.g. an individual is 6ft tall, whereas another one is 145cm tall)?

Accuracy as bias, precision as standard error.

Figure 8.2: Accuracy as bias, precision as standard error [author unknown].

Finding an issue with data quality after the analyses are completed is a sure-fire way of losing the stakeholder’s or client’s trust – check early and often!

8.3.1 Common Error Sources

If the analysts have some control over the data collection and initial processing, regular data validation tests are easier to set-up.

When analysts are dealing with legacy, inherited, or combined datasets, however, it can be difficult to recognize errors that arise from:
  • missing data being given a code;

  • NA/blank entries being given a code;

  • data entry errors;

  • coding errors;

  • measurement errors;

  • duplicate entries;

  • heaping (see Figure 8.3 for an example),

  • etc.

 
An illustration of heaping.

Figure 8.3: An illustration of heaping behaviour: self-reported time spent working in a day [personal file]. The entries for 7, 7.5, and 8 hours are omitted. Note the rounding off at various multiples of 5 minutes.

8.3.2 Detecting Invalid Entries

Potentially invalid entries can be detected with the help of a number of methods:

  • univariate descriptive statistics\(z-\)score, count, range, mean, median, standard deviation, etc.;

  • multivariate descriptive statistics\(n-\)way tables and logic checks, and

  • data visualization – scatterplot, histogram, joint histogram, etc. (see Data Visualization and Data Exploration for details).

Importantly, we point out that univariate tests do not always tell the whole story.


Example: consider, for instance, an artificial medical dataset consisting of 38 patients’ records, containing, among others, fields for the sex and the pregnancy status of the patients.

A summary of the data of interest is provided in the frequency counts (1-way tables) of the table below:

The analyst can quickly notice that some values are missing (in green) and that an entry has been miscoded as 99 (in yellow). Using only these univariate summaries, however, it is impossible to decide what to do with these invalid entries.

The 2-way frequency counts shed some light on the situation, and uncover other potential issues with the data.

One of the green entries is actually blank along the two variables; depending on the other information, this entry could be a candidate for imputation or outright deletion (more on these concepts in the next section).

Three other observations are missing a value along exactly one variable, but the information provided by the other variables may be complete enough to warrant imputation. Of course, if more information is available about the patients, the analyst may be able to determine why the values were missing in the first place (however privacy concerns at the collection stage might muddy the waters).

The mis-coded information on the pregnancy status (99, in yellow) is linked to a male client, and as such re-coding it as ‘No’ is likely to be a reasonable decision (although not necessarily the correct one… data measurements are rarely as clear cut as we may think upon first reflection).

A similar reasoning process should make the analyst question the validity of the entry shaded in red – the entry might very well be correct, but it is important to at least inquire about this data point, as the answer could lead to an eventual re-framing of the definitions and questions used at the collection stage.

In general, there is no universal or one-size-fits-all approach – a lot depends on the nature of the data. As always, domain expertise can provide valuable help and suggest fruitful exploration avenues.