11.6 Issues and Challenges

We all say we like data, but we don’t. We like getting insight out of data. That’s not quite the same as liking data itself. In fact, I dare say that I don’t quite care for data, and it sounds like I’m not alone. [Q.E. McCallum, Bad Data Handbook, [234]]

The data science landscape is littered with issues and challenges. We shall briefly discuss some of them in this section.

11.6.1 Bad Data

The main difficulties with data is that it is not always representative of the situation that we would like to model and that it might not be consistent (the collection and collation methods may have changed over time, say).There are other potential data issues [234]:

  • the data might be formatted for human consumption, not machine readability;

  • the data might contain lies and mistakes;

  • the data might not reflect reality, and

  • there might be additional sources of bias and errors (not only imputation bias, but replacing extreme values with average values, proxy reporting, etc.).

Seeking perfection in the data beyond a “reasonable” threshold181 can hamper the efforts of analysts: different quality requirements exist for academic data, professional data, economic data, government data, military data, service data, commercial data, etc. It can be helpful to remember the engineering dictum: “close enough is good enough” (in terms of completeness, coherence, correctness, and accountability). The challenge lies in defining what is “close enough” for the application under consideration.

Even when all (most?) data issues have been mitigated, there remains a number of common data analysis pitfalls:

  • analyzing data without understanding the context;

  • using one and only one tool (by choice or by fiat) – neither the “cloud”, nor Big Data, nor Deep Learning, nor Artificial Intelligence will solve all of an organization’s problems;

  • analyzing data just for the sake of analysis;

  • having unrealistic expectations of data analysis/DS/ML/AI – in order to optimize the production of actionable insights from data, we must first recognize the methods’ domains of application and their limitations.

11.6.2 Overfitting/Underfitting

In a traditional statistical model, \(p-\)values and goodness-of-fit statistics are used to validate the model. But such statistics cannot always be computed for predictive data science models. We recognise a “good” model based on how well it performs on unseen data.

In practice, training sets and ML methods are used to search for rules and models that are generalizable to new data (or validation/testing sets).

Problems arise when knowledge that is gained from supervised learning does not generalize properly to the data. Ironically, this may occur if the rules or models fit the training set too well – in other words, the results are too specific to the training set (see Figure 11.37 for an illustration of overfitting and underfitting).

Illustration of underfitting and overfitting for a classification task.Illustration of underfitting and overfitting for a classification task.Illustration of underfitting and overfitting for a classification task.

Figure 11.37: Illustration of underfitting (left) and overfitting (right) for a classification task – the optimal classifier (middle) might reach a compromise between accuracy and simplicty.

A simple example may elucidate further. Consider the following set of rules regarding hair colour among humans:

  • vague rule – some people have black hair, some have brown hair, some blond, and some red (this is obviously “true”, but too general to be useful for predictions);

  • reasonable rule – in populations of European descent, approximately 45% have black hair, 45% brown hair, 7% blond and 3% red;

  • overly specific rule – in every 10,000 individuals of European descent, we predict there are 46.32% with black hair, 47.27% with brown hair, 6.51% with blond hair, and 0.00% with red hair (this rule presumably emerges from redhead-free training data).

With the overly specific rule, we would predict that there are no redheads in populations of European descent, which is blatantly false. This rule is too specific to the particular training subset that was used to produce it.182

More formally, underfitting and overfitting can be viewed as resulting from the level of model complexity (see Figure @(fig:uomc)).

nderfitting and overfitting as a function of model complexity; error prediction on training sample and testing sample.

Figure 11.38: Underfitting and overfitting as a function of model complexity; error prediction on training sample (blue) and testing sample (red). High error prediction rates for simple models are a manifestation of underfitting; large difference between error prediction rates on training and testing samples for complex models are a manifestation of overfitting. Ideally, model complexity is chosen to reach the situation’s ‘sweet spot’; fishing for the ideal scenario might diminish explanatory power (based on [2]).

Underfitting can be overcome by using more complex models (or models that use more of a dataset’s variables). Overfitting, on the other hand, can be overcome in several ways:

  • using multiple training sets (ensemble learning approaches), with overlap being allowed – this has the effect of reducing the odds of finding spurious patterns based on quirks of the training data;

  • using larger training sets may also remove signal which is too specific to a small training set – a 70% - 33% split is often suggested, and

  • using simpler models (or models that use a dataset with a reduced number of variables as input).

When using multiple training sets, the size of the dataset may also affect the suggested strategy: when faced with

  • small datasets (less than a few hundred observations, say, but that depends on numerous factors such as computer power and number of tasks), use 100-200 repetitions of a bootstrap procedure [3];

  • average-sized datasets (less than a few thousand observations), use a few repetitions of 10-fold cross-validation [3], [74] (see Figure 11.39 for an illustration);

  • large datasets, use a few repetitions of a holdout split (70%-33%, say).

No matter which strategy is eventually selected, the machine learning approach requires ALL models to be evaluated on unseen data.

nderfitting and overfitting as a function of model complexity; error prediction on training sample and testing sample.

Figure 11.39: Schematic illustration of cross-fold validation, for 8 replicates and 4 folds; \(8(4)=32\) models from a given family are built on various training sets (consisting of \(3/4\) of the available data – the training folds). Model family performance is evaluated on the respective holdout folds; the distribution of the performance metrics (in practice, some combination of the mean/median and standard deviation) can be used to compare various model families (based on [74]).

These issues will be revisited in Regression and Value Estimation and in Spotlight on Classification.

11.6.3 Appropriateness and Transferability

Data science models will continue to be used heavily in the near future; while there are pros and cons to their use on ethical and other non-technical grounds, their applicability is also driven by technical considerations. DS/ML/AI methods are not appropriate if:

  • existing (legacy) datasets absolutely must be used instead of ideal/appropriate datasets;183

  • the dataset has attributes that usefully predict a value of interest, but these attributes are not available when a prediction is required (e.g. the total time spent on a website may be predictive of a visitor’s future purchases, but the prediction must be made before the total time spent on the website is known);

  • class membership or numerical outcome is going to be predicted using an unsupervised learning algorithm (e.g. clustering loan default data might lead to a cluster contains many defaulters – if new instances get added to this cluster, should they automatically be viewed as loan defaulters?).

Every model makes certain assumptions about what is and is not relevant to its workings, but there is a tendency to only gather data which is assumed to be relevant to a particular situation.

If the data is used in other contexts, or to make predictions depending on attributes for which no data is available, then there might be no way to validate the results.184

This is not as esoteric a consideration as it might seem: over-generalizations and inaccurate predictions can lead to harmful results.

11.6.4 Myths and Mistakes

We end this chapter by briefly repeating various data science myths, originally found in [235]:

  1. DS is about algorithms

  2. DS is about predictive accuracy

  3. DS requires a data warehouse

  4. DS requires a large quantity of data

  5. DS requires only technical experts

as well as common data analysis mistakes [same source]:

  1. selecting the wrong problem

  2. getting by without metadata understanding.

  3. not planning the data analysis process

  4. insufficient business/domain knowledge

  5. using incompatible data analysis tools

  6. using tools that are too specific

  7. favouring aggregates over individual results

  8. running out of time

  9. measuring results differently than the client

  10. naı̈vely believing what one is told about the data

It remains the analyst’s/consultant’s responsibility to address these issues with the stakeholders/clients, and the earlier, the better. We cannot assume that everyone is on the same page – prod and ask, early and often.

References

[2]
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, 2008.
[3]
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R. Springer, 2014.
[74]
L. Torgo, Data Mining with R, 2nd ed. CRC Press, 2016.
[234]
Q. E. McCallum, Bad Data Handbook. O’Reilly, 2013.
[235]
A. K. Maheshwari, Business Intelligence and Data Mining. Business Expert Press, 2015.