11.6 Issues and Challenges
We all say we like data, but we don’t. We like getting insight out of data. That’s not quite the same as liking data itself. In fact, I dare say that I don’t quite care for data, and it sounds like I’m not alone. [Q.E. McCallum, Bad Data Handbook, ]
The data science landscape is littered with issues and challenges. We shall briefly discuss some of them in this section.
11.6.1 Bad Data
The main difficulties with data is that it is not always representative of the situation that we would like to model and that it might not be consistent (the collection and collation methods may have changed over time, say).There are other potential data issues :
the data might be formatted for human consumption, not machine readability;
the data might contain lies and mistakes;
the data might not reflect reality, and
there might be additional sources of bias and errors (not only imputation bias, but replacing extreme values with average values, proxy reporting, etc.).
Seeking perfection in the data beyond a “reasonable” threshold181 can hamper the efforts of analysts: different quality requirements exist for academic data, professional data, economic data, government data, military data, service data, commercial data, etc. It can be helpful to remember the engineering dictum: “close enough is good enough” (in terms of completeness, coherence, correctness, and accountability). The challenge lies in defining what is “close enough” for the application under consideration.
Even when all (most?) data issues have been mitigated, there remains a number of common data analysis pitfalls:
analyzing data without understanding the context;
using one and only one tool (by choice or by fiat) – neither the “cloud”, nor Big Data, nor Deep Learning, nor Artificial Intelligence will solve all of an organization’s problems;
analyzing data just for the sake of analysis;
having unrealistic expectations of data analysis/DS/ML/AI – in order to optimize the production of actionable insights from data, we must first recognize the methods’ domains of application and their limitations.
In a traditional statistical model, \(p-\)values and goodness-of-fit statistics are used to validate the model. But such statistics cannot always be computed for predictive data science models. We recognise a “good” model based on how well it performs on unseen data.
In practice, training sets and ML methods are used to search for rules and models that are generalizable to new data (or validation/testing sets).
Problems arise when knowledge that is gained from supervised learning does not generalize properly to the data. Ironically, this may occur if the rules or models fit the training set too well – in other words, the results are too specific to the training set (see Figure 11.37 for an illustration of overfitting and underfitting).
A simple example may elucidate further. Consider the following set of rules regarding hair colour among humans:
vague rule – some people have black hair, some have brown hair, some blond, and some red (this is obviously “true”, but too general to be useful for predictions);
reasonable rule – in populations of European descent, approximately 45% have black hair, 45% brown hair, 7% blond and 3% red;
overly specific rule – in every 10,000 individuals of European descent, we predict there are 46.32% with black hair, 47.27% with brown hair, 6.51% with blond hair, and 0.00% with red hair (this rule presumably emerges from redhead-free training data).
With the overly specific rule, we would predict that there are no redheads in populations of European descent, which is blatantly false. This rule is too specific to the particular training subset that was used to produce it.182
More formally, underfitting and overfitting can be viewed as resulting from the level of model complexity (see Figure @(fig:uomc)).
Underfitting can be overcome by using more complex models (or models that use more of a dataset’s variables). Overfitting, on the other hand, can be overcome in several ways:
using multiple training sets (ensemble learning approaches), with overlap being allowed – this has the effect of reducing the odds of finding spurious patterns based on quirks of the training data;
using larger training sets may also remove signal which is too specific to a small training set – a 70% - 33% split is often suggested, and
using simpler models (or models that use a dataset with a reduced number of variables as input).
When using multiple training sets, the size of the dataset may also affect the suggested strategy: when faced with
small datasets (less than a few hundred observations, say, but that depends on numerous factors such as computer power and number of tasks), use 100-200 repetitions of a bootstrap procedure ;
large datasets, use a few repetitions of a holdout split (70%-33%, say).
No matter which strategy is eventually selected, the machine learning approach requires ALL models to be evaluated on unseen data.
11.6.3 Appropriateness and Transferability
Data science models will continue to be used heavily in the near future; while there are pros and cons to their use on ethical and other non-technical grounds, their applicability is also driven by technical considerations. DS/ML/AI methods are not appropriate if:
existing (legacy) datasets absolutely must be used instead of ideal/appropriate datasets;183
the dataset has attributes that usefully predict a value of interest, but these attributes are not available when a prediction is required (e.g. the total time spent on a website may be predictive of a visitor’s future purchases, but the prediction must be made before the total time spent on the website is known);
class membership or numerical outcome is going to be predicted using an unsupervised learning algorithm (e.g. clustering loan default data might lead to a cluster contains many defaulters – if new instances get added to this cluster, should they automatically be viewed as loan defaulters?).
Every model makes certain assumptions about what is and is not relevant to its workings, but there is a tendency to only gather data which is assumed to be relevant to a particular situation.
If the data is used in other contexts, or to make predictions depending on attributes for which no data is available, then there might be no way to validate the results.184
This is not as esoteric a consideration as it might seem: over-generalizations and inaccurate predictions can lead to harmful results.
11.6.4 Myths and Mistakes
We end this chapter by briefly repeating various data science myths, originally found in :
DS is about algorithms
DS is about predictive accuracy
DS requires a data warehouse
DS requires a large quantity of data
DS requires only technical experts
as well as common data analysis mistakes [same source]:
selecting the wrong problem
getting by without metadata understanding.
not planning the data analysis process
insufficient business/domain knowledge
using incompatible data analysis tools
using tools that are too specific
favouring aggregates over individual results
running out of time
measuring results differently than the client
naı̈vely believing what one is told about the data
It remains the analyst’s/consultant’s responsibility to address these issues with the stakeholders/clients, and the earlier, the better. We cannot assume that everyone is on the same page – prod and ask, early and often.