11.2 Statistical Learning
We learn from failure, not from success! [B. Stoker, Dracula]
As humans, we learn (at all stages) by first taking in our environment, and then by:
answering questions about it;
creating categories, and
classifying and grouping its various objects and attributes.
In a way, the main concept of DS/ML/AI is to try to teach our machines (and thus, ultimately, ourselves) to gleam insight from data, and how to do this properly and efficiently, free of biases and pre-conceived notions – in other words, can we design algorithms that can learn?165
In that context, the simplest DS/ML/AI method is exploring the data (or a representative sample) to:
provide a summary through basic statistics – mean, mode, histograms, etc.;
make its multi-dimensional structure evident through data data visualization; and
look for consistency, considering what is in there and what is missing.
11.2.1 Types of Learning
In the data science context, more sophisticated approaches traditionally fall into a supervised or an unsupervised learning framework.
Supervised learning is akin to “learning with a teacher.” Typical tasks include classification, regression, rankings, and recommendations.
In supervised learning, algorithms use labeled training data to build (or train) a predictive model (i.e. “students give an answer to each exam question based on what they learned from worked-out examples provided by the teacher/textbook”); each algorithm’s performance is evaluated using test data for which the label is known but not used in the prediction (i.e. “the teacher provides the correct answers and marks the exam questions using the key”.)
In supervised learning, there are fixed targets against which to train the model (such as age categories, or plant species) – the categories (and their number) are known prior to the analysis.
Unsupervised learning, on the other hand, is akin to “self-learning by grouping similar exercises together as a study guide.” Typical tasks include clustering, association rules discovery, link profiling, and anomaly detection. Unsupervised algorithms use unlabeled data to find natural patterns in the data (i.e. “the teacher is not involved in the discovery process”); the drawback is that accuracy cannot be evaluated with the same degree of satisfaction (i.e. “students might end up with different groupings”).
In unsupervised learning, we don’t know what the target is, or even if there is one – we are simply looking for natural groups in the data (such as junior students who like literature, have longish hair, and know how to cook vs. students who are on a sports team and have siblings vs. financial professionals with a penchant for superhero movies, craft beer and Hello Kitty backpack vs. … ).
Other Learning Frameworks
Some data science techniques fit into both camps; others can be either supervised or unsupervised, depending on how they are applied, but there are other conceptual approaches, especially for AI tasks:
semi-supervised learning in which some data points have labels but most do not, which often occurs when acquiring data is costly (“the teacher provides worked-out examples and a list of unsolved problems to try out; the students try to find similar groups of unsolved problems and compare them with the solved problems to find close matches”), and
reinforcement learning, where an agent attempts to collect as much (short-term) reward as possible while minimizing (long-term) regret (“embarking on a Ph.D. with an advisor… with all the highs and the lows and maybe a diploma at the end of the process?”).
11.2.2 Data Science and Machine Learning Tasks
Outside of academia, DS/ML/AI methods are only really interesting when they help users ask and answer useful questions. Compare, for instance:
Analytics – “How many clicks did this link get?”
Data Science – “Based on the previous history of clicks on links of this publisher’s site, can I predict how many people from Manitoba will read this specific page in the next three hours?” or “Is there a relationship between the history of clicks on links and the number of people from Manitoba who will read this specific page?”
Quantitative Methods – “We have no similar pages whose history could be consulted to make a prediction, but we have reasons to believe that the number of hits will be strongly correlated with the temperature in Winnipeg. Using the weather forecast over the next week, can we predict how many people will access the specific page during that period?”
Data science and machine learning models are usually predictive (not explanatory): they show connections, and exploit correlations to make predictions, but they don’t reveal why such connections exist.
Quantitative methods, on the other hand, usually assume a certain level of causal understanding based on various first principles. That distinction is not always understood properly by clients and consultants alike. Common data science tasks (with representative questions) include :
classification and probability estimation – which undergraduates are likely to succeed at the graduate level?
value estimation – how much is a given client going to spend at a restaurant?
similarity matching – which prospective clients are most similar to a company’s establishes best clients?
clustering – do signals from a sensor form natural groups?
association rules discovery – what books are commonly purchased together at an online retailer?
profiling and behaviour description – what is the typical cell phone usage of a certain customer’s segment?
link prediction – J. and K. have 20 friends in common: perhaps they’d be great friends?
A classic example is provided by the UCI Machine Learning Repository Mushroom Dataset . Consider Amanita muscaria (commonly known as the fly agaric), a specimen of which is shown below.
Is it edible, or poisonous? There is a simple way to get an answer – eat it, wait, and see: if you do not die or get sick upon ingestion, it was edible; otherwise it was poisonous.
This test in unappealing for various reasons, however. Apart from the obvious risk of death, we might not learn much from the experiment; it is possible that this specific specimen was poisonous due to some mutation or some other factor (or that you had a pre-existing condition which combined with the fungus to cause you discomfort, etc.), and that fly agaric is actually edible in general (unlikely, but not impossible).
A predictive model, which would use features of a vast collection of mushroom species and specimens (including whether they were poisonous or edible) could help shed light on the matter: what do poisonous mushrooms have in common? What properties do edible mushrooms share?166
For instance, let’s say that Amanita muscaria has the following features:
gill size: narrow;
cap color: red.
We do not know a priori whether it is poisonous or edible. Is the available information sufficient to answer the question? Not on its own, no.167
But we could use past data, with correct edible or poisonous labels and the same set of predictors to build various supervised classification models to attempt to answer the question. A simple such model, a decision tree, is shown below.
The model prediction for Amanita muscaria follows the decision path shown in Figure 11.3:
some mushroom odors (musty, spicy, etc.) are associated with poisonous mushrooms, some (almond, anise) with edible mushrooms, but there are mushrooms with no specific odor in either category – for mushroom with ‘no odor’ (as it he case with Amanita muscaria), odor does not provide enough information for proper classification and we need to incorporate additional features into the decision path;
among mushrooms with no specific odor, some spore colours (black, etc.) are associated with edible mushrooms, some (almond, anise) with poisonous mushrooms, but there are mushrooms with ‘white’ spores in either category – the combination ‘no odor and white spores’ does not provide enough information to classify Amanita muscaria and we need to incorporate additional features into the decision path;
among mushrooms of no specific odor with white spores, some habitats (grasses, paths, wastes) are associated with edible mushrooms, but there are mushrooms in either category that are found in the ‘woods’ – the combination ‘no odor, white spores, found in the woods’ does not provide enough information to classify Amanita muscaria and we need to incorporate additional features into the decision path;
among white-spored forest mushroom with no specific odor, a broad gill size is associated with edible mushrooms, whereas a ‘narrow’ gill size is associated with poisonous mushrooms – as Amanita muscaria is a narrow-gilled, white-spored forest mushroom with no specific odor, the decision path predicts that it is poisonous.
Note that the cap color does not affect the decision path, however.168
The decision tree model does not explain why this particular combinations of features is associated with poisonous mushrooms – the decision path is not causal.
At this point, a number of questions naturally arise:
Would you have trusted an edible prediction?
How are the features measured?
What is the true cost of making a mistake?
Is the data on which the model is built representative?
What data is required to build trustworthy models?
What do we need to know about the model in order to trust it?