8.2 General Principles

Dilbert: I didn’t have any accurate numbers, so I just made up this one. Studies have shown that accurate numbers aren’t any more useful that the ones you make up.
Pointy-Haired Boss: How many studies showed that?
Dilbert: [beat] Eighty-seven.
(S. Adams, Dilbert, 8 May 2008)

8.2.1 Approaches to Data Cleaning

We recognize two main philosophical approaches to data cleaning and validation:

  • methodical, and

  • narrative.

The methodical approach consists in running through a checklist of potential issues and flagging those that apply to the data.

The narrative approach, on the other hand, consists in exploring the dataset while searching for unlikely or irregular patterns. Which approach the consultant/analyst opts to follow depends on a number of factors, not the least of which is the client’s needs and views on the matter – it is important to discuss this point with the clients.

8.2.2 Pros and Cons

The methodical approach focuses on syntax; the checklist is typically context-independent, which means that it (or a subset) can be reused from one project to another, which makes data analysis pipelines easy to implement and automate. In the same vein, common errors are easily identified.

On the flip side, the checklist may be quite extensive and the entire process may prove time-consuming.

The biggest disadvantage of this approach is that it makes it difficult to identify new types of errors. The narrative approach focuses on semantics; even false starts may simultaneously produce data understanding prior to switching to a more mechanical approach.

It is easy, however, to miss important sources of errors and invalid observations when the datasets have a large number of features.

There is an additional downside: domain expertise, coupled with the narrative approach, may bias the process by neglecting “uninteresting” areas of the dataset.

8.2.3 Tools and Methods

A non-exhaustive list of common data issues can be found in the Data Cleaning Bingo Card (see Figure 8.1).

Data cleaning bingo card

Figure 8.1: Data cleaning bingo card [J. Schellinck].

Other methods include:

  • visualizations – which may help easily identify observations that need to be further examined;

  • data summaries – # of missing observations; 5-pt summary, mean, standard deviation, skew, kurtosis, for numerical variables; distribution tables for categorical variables;

  • \(n\)-way tables – counts for joint distributions of categorical variables;

  • small multiples – tables/visualizations indexed along categorical variables, and

  • preliminary data analyses – which may provide “huh, that’s odd…” realizations.

It is important to note that there is nothing wrong with running a number of analyses to flush out data issues, but remember to label your initial forays as preliminary analyses. From the client or stakeholder’s perspective, repeated analyses may create a sense of unease and distrust, even if they form a crucial part of the analytical process.

In our (admittedly biased and incomplete) experience,

  • computer scientists and programmers tend to naturally favour the methodical approach, while

  • mathematicians (and sometimes statisticians) tend to naturally favour the narrative approach,

although we have met plenty of individuals with unexpected backgrounds in both camps.

This is not the place for identity politics: data scientists, data analysts, and quantitative consultants alike need to be comfortable with both approaches.

As an analogy, the narrative approach is akin to working out a crossword puzzle with a pen and accepting to put down potentially erroneous answers once in a while to try to open up the grid (what artificial intelligence researchers call the “exploration” approach).

The mechanical approach, on the other hand, is similar to working out the puzzle with a pencil and a dictionary, only putting down answers when their correctness is guaranteed (the “exploitation” approach of artificial intelligence).

More puzzles get solved when using the first approach, but missteps tend to be spectacular. Not as many puzzles get solved the second way, but the trade-off is that it leads to fewer mistakes.