8.1 Introduction

Martin K: Data is messy, Alison.
Alison M: Even after it’s been cleaned?
Martin K: Especially after it’s been cleaned.
(P. Boily, J. Schellinck, The Great Balancing Act [in progress]).

Data cleaning and data processing are essential aspects of quantitative analysis projects; analysts and consultants should be prepared to spend up to 80% of their time on data preparation, keeping in mind that:

  • processing should NEVER be done on the original dataset – make copies along the way;

  • ALL cleaning steps need to be documented;

  • if too much of the data requires cleaning up, the data collection procedure might need to be revisited, and

  • records should only be discarded as a last resort.

Another thing to keep in mind is that cleaning and processing may need to take place more than once depending on the type of data collection (one pass, batch, continuously), and that that it is essentially impossible to determine if all data issues have been found and fixed.

Note: for this module, we are assuming that the datasets of interest contain only numerical and/or categorical observations. Additional steps must be taken when dealing with unstructured data, such as text or images (we’ll have more to say on this topic later).