11.4 Classification and Value Estimation

The diversity of problems that can be addressed by classification algorithms is significant, and covers many domains. It is difficult to comprehensively discuss all the methods in a single book. [C.C. Aggarwal [6]]

11.4.1 Overview

The principles underlying classification, regression and class probability estimation are well-known and straightforward. Classification is a supervised learning endeavour in which a sample set of data (the training set) is used to determine rules and patterns that divide the data into predetermined groups, or classes. The training data usually consists of a randomly selected subset of the labeled data.173

In the testing phase, the model is used to assign a class to observations in the testing set, in which the label is hidden, in spite of being actually known.

The performance of the predictive model is then evaluated by comparing the predicted and the values for the testing set observations (but never using the training set observations). A number of technical issues need to be addressed in order to achieve optimal performance, among them: determining which features to select for inclusion in the model and, perhaps more importantly, which algorithm tochoose.

The mushroom (classification) model from Data Science and Machine Learning Tasks provides a clean example of a classification model, albeit one for which no detail regarding the training data and choice of algorithm were made available.

Applications

Classification and value estimation models are among the most frequently used of the data science models, and form the backbone of what is also known as predictive analytics. There are applications and uses in just about every field of human endeavour, such as:

  • medicine and health science – predicting which patient is at risk of suffering a second, and this time fatal, heart attack within 30 days based on health factors (blood pressure, age, sinus problems, etc.);

  • social policies – predicting the likelihood of required assisting housing in old age based on demographic information/survey answers;

  • marketing/business – predicting which customers are likely to cancel their membership to a gym based on demographics and usage;

  • in general, predicting that an object belongs to a particular class, or organizing and grouping instances into categories, or

  • enhancing the detection of relevant objects:

    • avoidance – “this object is an incoming vehicle”;

    • pursuit – “this borrower is unlikely to default on her mortgage”;

    • degree – “this dog is 90% likely to live until it’s 7 years old”;

  • economics – predicting the inflation rate for the coming two years based on a number of economic indicators.

Other examples may be found in [210][213].

Concrete Examples

Some concrete examples may provide a clearer picture of the types of supervised learning problems that quantitative consultants may be called upon to solve.

  • A motor insurance company has a fraud investigation department that studies up to 20% of all claims made, yet money is still getting lost on fraudulent claims. To help better direct the investigators, management would like to determine, using past data, if it is possible to predict

    • whether a claim is likely to be fraudulent?

    • whether a customer is likely to commit fraud in the near future?

    • whether an application for a policy is likely to result in a fraudulent claim?

    • the amount by which a claim will be reduced if it is fraudulent?

  • Customers who make a large number of calls to a mobile phone company’s customer service number have been identified as churn risks. The company is interested in reducing said churn. Can they predict

    • the overall lifetime value of a customer?

    • which customers are more likely to churn in the near future?

    • what retention offer a particular customer will best respond to?


In every classification scenario, the following questions must be answered before embarking on analysis:

  • What kind of data is required?

  • How much of it?

  • What would constitute a predictive success?

  • What are the risks associated with a predictive modeling approach?

These have no one-size-fits-all answers. In the absence of testing data, classification models cannot be used for predictive tasks, but may still be useful for descriptive tasks.

When testing data exists, the process is often quite similar, independently of the choice of the algorithm (see the classification pipeline shown in Figure 11.8).

A classification pipeline.

Figure 11.8: A classification pipeline, including training set, testing set, performance evaluation, and (eventual) deployment.

11.4.2 Classification Algorithms

The number of classification algorithms is truly staggering – it often seems as though new algorithms and variants are put forward on a monthly basis, depending on the task and on the type of data [6].

While some of them tend to be rather esoteric, there is a fairly small number of commonly-used workhorse algorithms/approaches that data scientists and consultants should at least have at their command (full descriptions are available in [2], [74], [137]):

  • logistic regression and linear regression are classical models which are often used by statisticians but rarely in a classification setting (the estimated coefficients are often used to determine the features’ importance); one of their strengths is that the machinery of standard statistical theory (hypothesis testing, confidence intervals, etc.) is still available to the analyst, but they are easily affected by variance inflation in the presence of predictor multi-colinearity, and the stepwise variable selection process that is typically used is problematic – regularization methods would be better suited in general [214] (see Figure (fig:class3) for illustrations);

  • neural networks have become popular recently due to the advent of deep learning; they might provide the prototypical example of a black box algorithm as they are hard to interpret; another issue is that they require a fair amount of data to train properly – we will have more to say on the topic in a later chapter;

  • decision trees are perhaps the most commons of all data science algorithms, but they tend to overfit the data when they are not pruned correctly, a process which often has to be done manually (see Figure (fig:class3) for an illustration) – we shall discuss the pros an cons of decision trees in general in Decision Trees;

  • naı̈ve Bayes classifiers have known quite a lot of success in text mining applications (more specifically as the basis of powerful spam filters), but, embarrassingly, no one is quite sure why they should work as well as they do given that one of their required assumptions (independence of priors) is rarely met in practice (see Figure (fig:class3) for an illustration);

  • support vector machines attempt to separate the dataset by “fitting” as wide of a “tube” as possible through the classes (subjected to a number of penalty constraints); they have also known successes, most notably in the field of digital handwriting recognition, but their decision boundaries (the tubes in question) tend to be non-linear and quite difficult to interpret; nevertheless, they may help mitigate some of the difficulties associated with big data (see Figure (fig:class2) for an illustration);

  • nearest neighbours classifiers basically implement a voting procedure and require very little assumptions about the data, but they are not very stable as adding training points may substantially modify the boundary (see Figures (fig:class3) and (fig:class2) for illustrations);

Boosting methods [3], [215] and Bayesian methods ([26], [216], [217]) are also becoming more popular.

Illustrations of various classifiers.Illustrations of various classifiers.Illustrations of various classifiers.

Figure 11.9: Illustrations of various classifiers (linear regression, top left; optimal Bayes, top right; 1NN and 15NN, middle left and right, respectively, on an artificial dataset (from [2]); decision tree depicting the chances of survival for various disasters (fictional, based on [218]). Note that linear regression is more stable, simpler to describe, but less accurate than \(k\)NN and optimal Bayes.

Illustration of $k$NN and SVM classifiers.Illustration of $k$NN and SVM classifiers.

Figure 11.10: Illustration of a \(k\) nearest neighbour (left) and a support vector machines classifier (right, based on [137]). What is the 6NN prediction for the location marked by a question mark? What about the 19NN prediction?

11.4.3 Decision Trees

In order to highlight the relative simplicity of most classification algorithms, we will discuss the workings of ID3, a historically significant decision tree algorithm.174

Classification trees are perhaps the most intuitive of all supervised methods: classification is achieved by following a path up the tree, from its root, through its branches, and ending at its leaves (alghough typically the tree is depicted with its root at the top and its leaves at the bottom).

To make a prediction for a new instance, it suffices to follow the path down the tree, reading the prediction directly once a leaf is reached. It sounds simple enough in theory, but in practice, creating the tree and traversing it might be time-consuming if there are too many variables in the dataset (due to the criterion that is used to determine how the branches split).

Prediction accuracy can be a concern in trees whose growth is unchecked. In practice, the criterion of purity at the leaf-level (that is to say, all instances in a leaf belong to the same leaf) is linked to bad prediction rates for new instances. Other criteria are often used to prune trees, which may lead to impure leaves.

How do we grow such trees?

For predictive purposes, we need a training set and a testing set upon which to evaluate the tree’s performance. Ross Quinlan’s Iterative Dichotomizer 3 (a precursor to the widely-used C4.5 and C5.0) follows a simple procedure:

  1. split the training data (parent) set into (children) subsets, using the different levels of a particular attribute;

  2. compute the information gain for each subset;

  3. select the most advantageous split, and

  4. repeat for each node until some leaf criterion is met.

Entropy is a measure of disorder in a set \(S\). Let \(p_i\) be the proportion of observations in \(S\) belonging to category \(i\), for \(i=1,\ldots, n\). The entropy of \(S\) is given by \[E(S)=-\sum_{i=1}^n p_i\log p_i.\] If the parent set \(S\) consisting of \(m\) records is split into \(k\) children sets \(C_1,\ldots, C_k\) containing \(q_1, \ldots, q_k\) records, respectively, then the information gained from the split is \[I(S:C_1,\ldots,C_k)=E(S)-\frac{1}{m}\sum_{j=1}^kq_jE(C_j).\] The sum term in the information gain equation is a weighted average of the entropy of the children sets.

If the split leads to little disorder in the children, then \(\textrm{IG}(S;C_1,\ldots,C_k)\) is high; if the split leads to similar disorder in both children and parent, then \(\textrm{IG}(S;C_1,\ldots,C_k)\) is low.

Consider, as in Figure 11.11, two splits shown for a parent set with 30 observations separated into 2 classes: \(\circ\) and \(\star\).

Picking the optimal information gain split.Picking the optimal information gain split.

Figure 11.11: Picking the optimal information gain split (taken from [137]).

Visually, it appears as though the binary split does a better job of separating the classes. Numerically, the entropy of the parent set \(S\) is \[\begin{aligned} E(S)&=-p_{\circ}\log p_{\circ} - p_{\star}\log p_{\star}\\&=-\frac{16}{30}\log \frac{16}{30} - \frac{14}{30}\log \frac{14}{30} \approx 0.99.\end{aligned}\] For the binary split (on the left), leading to the children set \(L\) (left) and \(R\) (right), the respective entropies are \[E(L)=-\frac{12}{13}\log \frac{12}{13} - \frac{1}{13}\log \frac{1}{13} \approx 0.39\] and \[E(R)=-\frac{4}{17}\log \frac{4}{17} - \frac{13}{17}\log \frac{13}{17} \approx 0.79,\] so that the information gained by that split is \[\textrm{IG}(S;C_L,C_R)\approx 0.99-\frac{1}{30}\left[13\cdot 0.39+17\cdot 0.79\right]=0.37.\]

On its own, this value is not substantially meaningful – it is only in comparison to the information gained from other splits that it becomes useful.

A similar computation for the ternary split leads to \(\textrm{IG}(S;C_1,C_2,C_3)\approx 0.13\), which is indeed smaller than the information gained by the binary split – of these two options, ID3 would select the first as being most advantageous.

Decision trees have numerous strengths: they

  • are easy to interpret, providing, as they do, a white box model – predictions can always be explained by following the appropriate paths;

  • can handle numerical and categorical data simultaneously, without first having to “binarise” the data;

  • can be used with incomplete datasets, if needed (although there is still some value in imputing missing observations);

  • allow for built-in feature selection as less relevant features do not tend to be used as splitting features;

  • make no assumption about independence of observations, underlying distributions, multi-colinearity, etc., and can thus be used without the need to verify assumptions;

  • lend themselves to statistical validation (in the form of cross-validation), and

  • are in line with human decision-making approaches, especially when such decisions are taken deliberately.

On the other hand, they are

  • not usually as accurate as other more complex algorithms, nor as robust, as small changes in the training data can lead to a completely different tree, with a completely different set of predictions;[This can become problematic when presenting the results to a client whose understanding of these matters is slight.]

  • particularly vulnerable to overfitting in the absence of pruning — and pruning procedures are typically fairly convoluted (some algorithms automate this process, using statistical tests to determine when a tree’s “full” growth has been achieved), and

  • biased towards categorical features with high number of levels, which may give such variables undue importance in the classification process.

Information gain tends to grow small trees in its puruit of pure leaves, but it is not the only splitting metric in use (Gini impurity, variance reduction, etc.).

Notes

ID3 is a precursor of C4.5, perhaps the most popular decision tree algorithm on the market. There are other tree algorithms, such as C5.0, CHAID, MARS, conditional inference trees, CART, etc., each grown using algorithms with their own strengths and weaknesses.

Regression trees are grown in a similar fashion, but with a numerical response variable (predicted inflation rate, say), which introduces some complications [2], [3].

Decision trees can also be combined together using boosting algorithms (such as AdaBoost) or random forests, providing a type of voting procedure also known as ensemble learning – an individual tree might make middling predictions, but a large number of judiciously selected trees are likely to make good predictions, on average [2], [3], [215] – we will re-visit these concepts in a later chapter.

Additionally:

  • since classification is linked to probability estimation, approaches that extend the basic ideas of regression models could prove fruitful;

  • rare occurrences are often more interesting and more difficult to predict and identify than regular instances – historical data at Fukushima’s nuclear reactor prior to the 2011 meltdown could not have been used to learn about meltdowns, for obvious reasons;175

  • with big datasets, algorithms must also consider efficiency – thankfully, decision trees are easily parallelizable.

11.4.4 Performance Evaluation

As a consequence of the (so-called) No-Free-Lunch Theorem, no single classifier can be the best performer for every problem. Model selection must take into account:

  • the nature of the available data;

  • the relative frequencies of the classification sub-groups;

  • the stated classification goals;

  • how easily the model lends itself to interpretation and statistical analysis;

  • how much data preparation is required;

  • whether it can accommodate various data types and missing observations;

  • whether it performs well with large datasets, and

  • whether it is robust against small data departures from theoretical assumptions.

Past success is not a guarantee of future success – it is the analyst’s responsibility to try a variety of models. But how can the “best” model be selected?

When a classifier attempts to determine what kind of music a new customer would prefer, there is next to no cost in making a mistake; if, on the other hand, the classifier attempts to determine the presence or absence of cancerous cells in lung tissue, mistakes are more consequential. Several metrics can be used to assess a classifier’s performance, depending on the context.

Binary classifiers (presented in the abstract in Figure 11.12) are simpler and have been studied far longer than multi-level classifiers; consequently, a larger body of evaluation metrics is available for these classifiers.

In the medical literature, for instance, \(\textrm{TP}\), \(\textrm{TN}\), \(\textrm{FP}\) and \(\textrm{FN}\) stand for True Positives, True Negatives, False Positives, and False Negatives, respectively.

A general binary classifier.

Figure 11.12: A general binary classifier.

A perfect classifier would be one for which both \(\textrm{FP},\textrm{FN}=0\), but in practice, that rarely ever happens (if at all).

Traditional performance metrics include:

  • sensitivity: \(\textrm{TP}/\textrm{AP}\)

  • specificity: \(\textrm{TN}/\textrm{AN}\)

  • precision: \(\textrm{TP}/\textrm{PP}\)

  • negative predictive value: \(\textrm{TN}/\textrm{PN}\)

  • false positive rate: \(\textrm{FP}/\textrm{AN}\)

  • false discovery rate: \(1-\textrm{TP}/\textrm{PP}\)

  • false negative rate: \(\textrm{FN}/\textrm{AP}\)

  • accuracy: \((\textrm{TP}+\textrm{TN})/\textrm{T}\)

  • \(F_1-\)score: \(2\textrm{TP}/(2\textrm{TP}+\textrm{FP}+\textrm{FN})\)

  • MCC: \(\frac{\textrm{TP}\cdot \textrm{TN}-\textrm{FP}\cdot \textrm{FN}}{\sqrt{\textrm{AP}\cdot \textrm{AN}\cdot \textrm{PP} \cdot \textrm{PN}}}\)

  • informedness/ROC: \(\textrm{TP}/\textrm{AP}+\textrm{TN}/\textrm{AN}-1\)

  • markedness: \(\textrm{TP}/\textrm{PP}+\textrm{TN}/\textrm{PN}-1\)

The confusion matrices of two artificial binary classifers for a testing set are shown in 11.13.

Performance metrics for two binary classifiers.

Figure 11.13: Performance metrics for two (artificial) binary classifiers.

Both classifiers have an accuracy of \(80\%\), but while the second classifier sometimes makes a wrong prediction for \(A\), it never does so for \(B\), whereas the first classifier makes erroneous predictions for both \(A\) and \(B\).

On the other hand, the second classifier mistakenly predicts occurrence \(A\) as \(B\) 16 times while the first one only does so 6 times. So which one is best?

The performance metrics alone do not suffice to answer the question: the cost associated with making a mistake must also be factored in. Furthermore, it could be preferable to select performance evaluation metrics that generalize more readily to multi-level classifiers (see Figure 11.14 for examples of associated confusion matrices).

Performance metrics for  multi-level classifiers.Performance metrics for  multi-level classifiers.

Figure 11.14: Performance metrics for (artificial) multi-level classifiers: ternary - left; senary - right (personal files).

The accuracy is the proportion of correct predictions amid all the observations; its value ranges from 0% to 100%. The higher the accuracy, the better the match, and yet, a predictive model with high accuracy may nevertheless be useless thanks to the Accuracy Paradox (see rare occurrence footnote).

The Matthews Correlation Coefficient (MCC), on the other hand, is a statistic which is of use even when the classes are of very different sizes. As a correlation coefficient between the actual and predicted classifications, its range varies from \(−1\) to \(1\).

If \(\textrm{MCC}=1\), the predicted and actual responses are identical,while if \(\textrm{MCC}=0\), the classifier performs no better than arandom prediction (“flip of a coin”).

It is also possible to introduce two non-traditional performance metrics (which are nevertheless well-known statistical quantities) to describe how accurately a classifier preserves the classification distribution (rather than how it behaves on an observation-by-observation basis):

  • Pearson’s \(\chi^2\): \(\frac{1}{\textrm{T}}\left((\textrm{PP}-\textrm{AP})^2/\textrm{PP} + (\textrm{PN}-\textrm{AN})^2/\textrm{PN}\right)\)

  • Hist: \(\frac{1}{\textrm{T}}\left(\left|\textrm{PP}-\textrm{AP}\right|+\left|\textrm{PN}-\textrm{AN}\right|\right)\)

Note, however, that these are non-standard performance metrics. For a given number of levels, the smaller these quantities, the more similar the actual and predicted distributions.

For numerical targets \(y\) with predictions \(\hat{y}\), the confusion matrix is not defined, but a number of classical performance evaluation metrics can be used on the testing set: the

  • mean squared and mean absolute errors \[\textrm{MSE}=\textrm{mean}\left\{(y_i-\hat{y}_i)^2\right\}, \quad \textrm{MAE}=\textrm{mean}\{|y_i-\hat{y}_i|\};\]

  • normalized mean squared/mean absolute errors \[\begin{aligned} \textrm{NMSE}&=\frac{\textrm{mean}\left\{(y_i-\hat{y}_i)^2\right\}}{\textrm{mean}\left\{(y_i-\overline{y})^2\right\}}, \\ \textrm{NMAE}&=\frac{\textrm{mean}\left\{|y_i-\hat{y}_i|\right\}}{\textrm{mean}\left\{|y_i-\overline{y}|\right\}};\end{aligned}\]

  • mean average percentage error \[\textrm{MAPE}=\textrm{mean}\left\{\frac{|y_i-\hat{y}_i|}{y_i}\right\};\]

  • correlation \(\rho_{y,\hat{y}}\), which is based on the notion that for good models, the predicted values and the actual values should congregate around the lines \(y=\hat{y}\) (see Figure @ref(fig:predictions for an illustration).

Predicted and actual numerical responses.

Figure 11.15: Predicted and actual numerical responses (personal file).

As is the case for classification, an isolated value estimation performance metric does not provide enough of a rationale for model validation/selection. One possible exception: normalized evaluation metrics do provide some information about the relative quality of performance (see [2], [3]).

11.4.5 Case Study: Minnesota Tax Audit

Large gaps between revenue owed (in theory) and revenue collected (in practice) are problematic for governments. Revenue agencies implement various fraud detection strategies (such as audit reviews) to bridge that gap.

Since business audits are rather costly, there is a definite need for algorithms that can predict whether an audit is likely to be successful or a waste of resources.

In Data Mining Based Tax Audit Selection: A Case Study of a Pilot Project at the Minnesota Department of Revenue [127], Hsu et al. study the Minnesota Department of Revenue’s (DOR) tax audit selection process with the help of classification algorithms.

Objective

The U.S. Internal Revenue Service (IRS) estimated that there were large gaps between revenue owed and revenue collected for 2001 and for 2006. Using DOR data, the authors sought to increase efficiency in the audit selection process and to reduce the gap between revenue owed and revenue collected.

Methodology

The authors took the following steps:

  1. data selection and separation: experts selected several hundred cases to audit and divided them into training, testing and validating sets;

  2. classification modeling using MultiBoosting, Naïve Bayes, C4.5 decision trees, multilayer perceptrons, support vector machines, etc;

  3. evaluation of all models was achieved by testing the model on the testing set – models originally performed poorly on the testing set until the size of the business being audited was recognized to have an effect, leading to two separate tasks (large and small businesses);

  4. model selection and validation was done by comparing the estimated accuracy between different classification model predictions and the actual field audits. Ultimately, MultiBoosting with Naïve Bayes was selected as the final model; the combination also suggested some improvements to increase audit efficiency.

Data

The data consisted of selected tax audit cases from 2004 to 2007, collected by the audit experts, which were split into training, testing and validation sets:

  • the training data set consisted of Audit Plan General (APGEN) Use Tax audits and their results for the years 2004-2006;

  • the testing data consisted of APGEN Use Tax audits conducted in 2007 and was used to test or evaluate models (for Large and Smaller businesses) built on the training dataset,

  • while validation was assessed by actually conducting field audits on predictions made by models built on 2007 Use Tax return data processed in 2008.

None of the sets had records in common (see Figure 11.16).

Data sources for APGEN mining.

Figure 11.16: Data sources for APGEN mining [127]. Note the 6 final sets which feed the Data Analysis component.

Strengths and Limitations of Algorithms

  • The Naïve Bayes classification scheme assumes independence of the features, which rarely occurs in real-world situations. This approach is also known to potentially introduce bias to classification schemes. In spite of this, classification models built using Naïve Bayes have a successfully track record.

  • MultiBoosting is an ensemble technique that uses committee (i.e. groups of classification models) and “group wisdom” to make predictions; unlike other ensemble techniques, it is different from other ensemble techniques in the sense that it forms a committee of sub-committees (i.e., a group of groups of classification models), which has a tendency to reduce both bias and variance of predictions (see [6], [215] for more information on these topics).

Procedures

Classification schemes need a response variable for prediction: audits which yielded more than $500 per year in revenues during the audit period were classified as Good; the others were Bad. The various models were tested and evaluated by comparing the performances of the manual audits (which yield the actual revenue) and the classification models (the predicted classification).

The procedure for manual audit selection in the early stages of the study required:

  1. DOR experts selecting several thousand potential cases through a query;

  2. DOR experts further selecting several hundreds of these cases to audit;

  3. DOR auditors actually auditing the cases, and

  4. calculating audit accuracy and return on investment (ROI) using the audits results.

Once the ROIs were available, data mining started in earnest. The steps involved were:

  1. Splitting the data into training, testing, and validating sets.

  2. Cleaning the training data by removing “bad” cases.

  3. Building (and revising) classification models on the training dataset. The first iteration of this step introduced a separation of models for larger businesses and relatively smaller businesses according to their average annual withholding amounts (the threshold value that was used is not revealed in [127]).

  4. Selecting separate modeling features for the APGEN Large and Small training sets. The feature selection process is shown in Figure 11.17.

  5. Building classification models on the training dataset for the two separate class of business (using C4.5, Naïve Bayes, multilayer perceptron, support vector machines, etc.), and assessing the classifiers using precision and recall with improved estimated ROI: \[\text{Efficiency} = \text{ROI} = \frac{\text{Total revenue generated}}{\text{Total collection cost}}\]

Feature selection process.

Figure 11.17: Feature selection process [127]. Note the involvement of domain experts.

Results, Evaluation and Validation

The models that were eventually selected were combinations of MultiBoosting and Naïve Bayes (C4.5 produced interpretable results, but its performance was shaky). For APGEN Large (2007), experts had put forward 878 cases for audit (495 of which proved successful), while the classification model suggested 534 audits (386 of which proved successful). The theoretical best process would find 495 successful audits in 495 audits performed, while the manual audit selection process needed 878 audits in order to reach the same number of successful audits.

For APGEN Small (2007), 473 cases were recommended for audit by experts (only 99 of which proved successful); in contrast, 47 out of the 140 cases selected by the classification model were successful. The theoretical best process would find 99 successful audits in 99 audits performed, while the manual audit selection process needed 473 audits in order to reach the same number of successful audits.

In both cases, the classification model improves on the manual audit process: roughly 685 data mining audits to reach 495 successful audits of APGEN Large (2007), and 295 would be required to reach 99 successful audits for APGEN Small (2007), as can be seen in Figure 11.18.

Audit resource deployment efficiency.Audit resource deployment efficiency.

Figure 11.18: Audit resource deployment efficiency [127]. Top: APGEN Large (2007). Bottom: APGEN Small (2007). In both cases, the Data Mining approach was more efficient (the slope of the Data Mining vector is ‘closer’ to the Theoretical Best vector than is the Manual Audit vector).

Figure 11.19 presents the confusion matrices for the classification model on both the APGEN Large and Small 2007 datasets.

Audit resource deployment efficiency.Audit resource deployment efficiency.

Figure 11.19: Confusion matrices for audit evaluation [127]. Top: APGEN Large (2007). Bottom: APGEN Small (2007). \(R\) stands for revenues, \(C\) for collection costs.

The revenue \(R\) and collection cost \(C\) entries can be read as follows: the 47 successful audits which were correctly identified by the model for APGEN Small (2007) correspond to cases consuming 9.9% of collection costs but generating 42.5% of the revenues. Similarly, the 281 bad audits correctly predicted by the model represent notable collection cost savings. These are associated with 59.4% of collection costs but they generate only 11.1% of the revenues.

Once the testing phase of the study was conpleted, the DOR validated the data mining-based approach by using the models to select cases for actual field audits in a real audit project. The prior success rate of audits for APGEN Use tax data was 39% while the model was predicting a success rate of 56%; the actual field success rate was 51%.

Take-Aways

A substantial number of models were churned out before the team made a final selection. Past performance of a model family in a previous project can be used as a guide, but it provides no guarantee regarding its performance on the current data – remember the No Free Lunch (NFL) Theorem [219]: nothing works best all the time!

There is a definite iterative feel to this project: the feature selection process could very well require a number of visits to domain experts before the feature set yields promising results. This is a valuable reminder that the data analysis team should seek out individuals with a good understand of both data and context. Another consequence of the NFL is that domain-specific knowledge has to be integrated in the model in order to beat random classifiers, on average [220].

Finally, this project provides an excellent illustration that even slight improvements over the current approach can find a useful place in an organization – data science is not solely about Big Data and disruption!

11.4.6 Toy Example: Kyphosis Dataset

As a basic illustration of these concepts, consider the following example. Kyphosis is a medical condition related to an excessive convex curvature of the spine. Corrective spinal surgery is at times performed on children.

A dataset of 81 observations and 4 attributes has been collected (we have no information on how the data was collected and how representative it is likely to be, but those details can be gathered from [221]).

The attributes are:

  • kyphosis (absent or present after presentation)

  • age (at time of operation, in months)

  • number (of vertebrae involved)

  • start (topmost vertebra operated on)

The natural question of interest for this dataset is:

“how do the three explanatory attributes impact the operation’s success?”

We use the rpart implementation of Classification and Regression Tree (CART) in R to generate a decision tree. Strictly speaking, this is not a predictive supervised task as we treat the entire dataset as a training set for the time being – there are no hold-out testing observations.

The results are shown in Figure 11.20. Interestingly, it would appear that the variable number does not play a role in determining the success of the operation (for the observations in the dataset).

Kyphosis decision tree visualization.

Figure 11.20: Kyphosis decision tree visualization. Only two features are used to construct the tree. We also note that the leaves are not pure – there are blue and red instances in 3 of the 5 classification regions.

Furthermore, the decision tree visualization certainly indicates that its leaves are not pure (see Figure 11.21). Some additional work suggests that the tree is somewhat overgrown and that it could benefit from being pruned after the first branching point.

Pruning a decision tree.Pruning a decision tree.

Figure 11.21: Pruning a decision tree – the original tree (left) is more accurate/more complex than the pruned tree (right).

At any rate, it remains meangingless to discuss the performance of the tree for predictive purposes if we are not using a holdout testing sample (not to say anything about the hope of generalizing to a larger population).

To that end, we trained a model on 50 randomly selected observations and evaluated the performance on the remaining 31 observations (the structure of the tree is not really important at this stage). The results are shown in Figure 11.22. Is the model “good”?

Kyphosis decision tree performance evaluation.

Figure 11.22: Kyphosis decision tree – performance evaluation. The accuracy and \(F_1\) score are good, but the false discovery rate and false negative rate are not so great. This tree is good at predicting successful surgeries, but not fantastic at predicting failed surgeries. Is it still useful?

It is difficult to answer this question in the machine learning sense without being able to compare its performance metrics with those of other models (or families of models).176

In Model Selection, we will briefly discuss how estimate a model’s true predictive error rate through cross-validation. We will also discuss a number of other issues that can arise when ML/AI methods are not used correctly.

We show how to obtain these decision trees via R in Classification: Kyphosis Dataset.

References

[2]
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, 2008.
[3]
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R. Springer, 2014.
[6]
C. C. Aggarwal, Ed., Data Classification: Algorithms and Applications. CRC Press, 2015.
[26]
E. Ghashim and P. Boily, A Soft Introduction to Bayesian Data Analysis,” Data Science Report Series, 2020.
[74]
L. Torgo, Data Mining with R, 2nd ed. CRC Press, 2016.
[127]
K.-W. Hsu, N. Pathak, J. Srivastava, G. Tschida, and E. Bjorklund, “Data Mining Based Tax Audit Selection: A Case Study of a Pilot Project at the Minnesota Department of Revenue,” in Real World Data Mining Applications, Cham: Springer International Publishing, 2015, pp. 221–245. doi: 10.1007/978-3-319-07812-0_12.
[137]
F. Provost and T. Fawcett, Data Science for Business. O’Reilly, 2015.
[210]
B. Kitts et al., “Click fraud detection: Adversarial pattern recognition over 5 years at microsoft,” in Annals of information systems (special issue on data mining in real-world applications), Springer, 2015, pp. 181–201. doi: 10.1007/978-3-319-07812-0.
[213]
B. Kitts, “Product targeting from rare events: Five years of one-to-one marketing at CPI,” Marketing Science Conference, 2005.
[214]
T. Hastie, T. Tibshirani, and M. Wainwright, Statistical Learning with Sparsity: The LASSO and Generalizations. CRC Press, 2015.
[215]
O. Leduc and P. Boily, Boosting with AdaBoost and gradient boosting,” Data Action Lab Blog, 2019.
[216]
C. F. Robert, Le choix bayésien - principes et pratique. Springer-Verlag France, 2006.
[217]
B. Efron, Large Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press, 2010.
[218]
A. Ng and K. Soo, Eds., Surviving a Disaster, in Numsense! algobeans, 2016.
[219]
D. H. Wolpert, “The lack of a priori distinctions between learning algorithms,” Neural Computation, vol. 8, no. 7, pp. 1341–1390, 1996, doi: 10.1162/neco.1996.8.7.1341.
[220]
D. H. Wolpert and W. G. Macready, “Coevolutionary free lunches,” IEEE Transactions on Evolutionary Computation, vol. 9, no. 6, pp. 721–735, 2005, doi: 10.1109/TEVC.2005.856205.
[221]
J. Chambers and T. Hastie, Statistical models in s. Wadsworth; Brooks/Cole, 1992.