7.5 Getting Insight From Data
With all of the appropriate context now in mind, we can finally turn to the main attraction, data analysis proper. Let us start this section with a few definitions, in order to distinguish between some of the common categories of data analysis.
What is Data Analysis?
We view finding patterns in data as being data analysis’s main goal. Alternatively, we could describe the data analysis process as using data to:
answer specific questions;
help in the decisionmaking process;
create models of the data;
describe or explain the situation or system under investigation,
etc.
While some practitioners include other analyticallike activities, such as testing (scientific) hypotheses, or carrying out calculations on data, we think of those as separate activities.
What is Data Science?
One of the challenges of working in the data science field is that nearly all quantitative work can be described as data science (often to a ridiculous extent).
Our simple definition paraphrases T. Kwartler: data science is the collection of processes by which we extract useful and actionable insights from data. Robinson [135] further suggests that these insights usually come via visualization and (manual) inferential analysis.
The noted data scientist H. Mason thinks of the discipline as “the working intersection of statistics, engineering, computer science, domain expertise, and ‘hacking’” [136].
What is Machine Learning?
Starting in the 1940s, researchers began to take seriously the idea that machines could be taught to learn, adapt and respond to novel situations.
A wide variety of techniques, accompanied by a great deal of theoretical underpinning, were created in an effort to achieve this goal.
Machine learning is typically used to obtain “predictions” (or “advice”), while reducing the operator’s analytical, inferential and decisional workload (although it is still present to some extent) [135].
What is Artificial/Augmented Intelligence?
The science fiction answer is that artificial intelligence is nonhuman intelligence that has been engineered rather than one that has evolved naturally. Practically speaking, this translates to “computers carrying out tasks that only humans can do”.
A.I. attempts to remove the need for oversight, allowing for automatic “actions” to be taken by a completely unattended system.
These goals are laudable in an academic setting, but we believe that stakeholders (and humans, in general) should not seek to abdicate all of their agency in the decisionmaking process. As such, we follow the lead of various thinkers and suggest further splitting A.I. into general A.I. (who would operate independently of human intelligence) and augmented intelligence (which enhances human intelligence).
These approaches can be further broken down into 4 core key buckets (see Figure 7.9), moving roughly from low value/low difficulty propositions (left) to high value/high difficulty propositions (right).
For instance, a shoe store could conduct the following analyses:
 Descriptive

Sales report
 Diagnostic

Why did the sales take a large dip?
 Predictive

What is the sales forecast next quarter?
 Prescriptive:

How should we change the product mix to reach our target sales goal?
7.5.1 Asking the Right Questions
Definitions aside, however, data analysis, data science, machine learning, and artificial intelligence are about asking questions and providing answers to these questions. We might ask various types of questions, depending on the situation.
Our position is that, from a quantitative perspective, there are only really three types of questions:
analytics questions;
data science questions, and
quantitative methods questions.
Analytics questions could be something as simple as:
how many clicks did a specific link on my website get?
Data science questions tend to be more complex – we might ask something along the lines of:
if we know, historically, when or how often people click on links, can we predict how many people from Winnipeg will access a specific page on our website within the next three hours?
Whereas analyticstype questions are typically answered by counting things, data sciencelike questions are answered by using historical patterns to make predictions.
Quantitative methods questions might, in our view, be answered by making predictions but not necessarily based on historical data. We could build a model from first principles – the “physics” of the situation, as it were – to attempt to figure out what might happen.
For instance, if we thought there was a correlation between the temperature in Winnipeg and whether or not people click on the links in our website, then we might build a model that predicts “how many people from Winnipeg will access a page in the next week?”, say, by trying to predict the weather instead,^{126} which is not necessarily an easy task.
Analytics models do not usually predict or explain anything – they just report on the data, which is itself meant to represent the situation. A data mining or a data science model tends to be predictive, but not necessarily explanatory – it shows the existence of connections, of correlations, of links, but without explaining why the connections exist.
In a quantitative method model, we may start by assuming that we know what the links are, what the connections are – which presumably means that we have an idea as to why these connections exist^{127} – and then we try to explore the consequences of the existence of these connections and these links.
This leads to a singular realization that we share with new data scientists and analysts (potentially the single most important piece of advice they will receive in their quantitative career, and we are only halfjoking when we say it):
not every situation calls for analytics, data science, statistical analysis, quantitative methods, machine learning, or A.I.
Take the time to identify instances where more is asked out of the data than what it can actually yield, and be prepared to warn stakeholders, as early as possible, when such a situation is encountered.
If we cannot ask the right questions of the data, of the client, of the situation, and so on, any associated project is doomed to fail from the very beginning. Without questions to answer, analysts are wasting their time, running analyses for the sake of analysis – the finish line cannot be reached if there is no finish line.
In order to help clients/stakeholders, data analysts and scientists need:
questions to answer;
questions that can be answered by the types of methods and skills at their disposal, and
answers that will be recognized as answers.
“How many clicks did this link get?” is a question that is easily answerable if we have a dataset of links and clicks, but it might not be a question that the client cares to see answered.
Data analysts and scientists often find themselves in a situation where they will ask the types of questions that can be answered with the available data, but the answers might not prove actually useful.
From a data science perspective, the right question is one that leads to actionable insights. And it might mean that old data is discarded and new data is collected in order to answer it. Analysts should beware: given the sometimes onerous price tag associated with data collection, it is not altogether surprising that there will sometimes be pressure from above to keep working with the available data.
Stay strong – analysis on the wrong dataset is the wrong analysis!
The Wrong Questions
Wrong questions might be:
questions that are too broad or too narrow;
questions that no amount of data could ever answer,
questions for which data cannot reasonably be obtained, etc.
One of the issues with “wrong” questions is that they do not necessarily “break the pipeline”:
in the bestcase scenario, stakeholders, clients, colleagues will still recognize the answers as irrelevant.
in the worstcase scenario, policies will erroneously be implemented (or decisions made) on the basis of answers that have not been identified as misleading and/or useless.
Framing Questions
In general, data science questions are used to:
solve problems (fix pressing issues, understand why something is or isn’t happening, etc.);
create meaningful change (create new standards in the company, etc.),
support gut feelings (approve or disprove blind intuition).
One thing to note is that individuals prefer to answer a question quickly, especially in their area of expertise. It is also strongly suggested that analysts avoid glancing over the data before they settle on the question(s), to avoid “begging the question”. Finally, not that just as we can be blinded by love, we can also be blinded by solutions: the right solution to the right question is not necessarily the “sexiest” solution.
The website kdnuggets.com suggests the following roadmap to framing questions:
Understand the problem (opportunity vs problem)
What initial assumptions do I have about the situation?
How will the results be used?
What are the risks and/or benefits of answering this question?
What stakeholder questions might arise based on the answer(s)?
Do I have access to the data necessary for answering this question?
How will I measure my “success” criteria?
Example: Should I buy a house? But this is a bit vague; perhaps, instead, the question could be: should I buy a single house in Scotland? [based on an example by M. Kashef]
Answer: Let’s use the roadmap.
Understand the problem. I’ve been renting for two years and feel like I’m throwing my money away. I want a chance to invest in my own space instead of someone else’s.
What initial assumptions do I have about the situation? It’s going to be expensive but worth it – it’ll be an investment that appreciates over time.
How will the results be used? Either to buy a house or rent a bit longer to save more for a larger down payment.
What are the risks and/or benefits of answering this question? Risk: I could put myself under immense debt and become “house poor”. Benefits: I could get into the market just in time to make a fortune, and I won’t have to live under the uncertainty from my landlord possibly selling his home.
What stakeholder questions might arise based on the answer(s)? Would this new home be in an area that’s safe for kids? Will it be close to my workplace?
Do I have access to the data necessary to answer this question? Yes, through my real estate agent and online real estate brokerages, I can keep my finger on the pulse of the market.
How will I measure my “success” criteria? If I manage to buy a forever home within my $600k budget, say.
Additional Considerations
Specific questions are preferred over vague questions; questions that encourage qualification/quantification are preferred over Yes/No questions.
Here are some examples of Yes/No questions, which should be avoided [Health Families BC]:
Is our revenue increasing over time? Has it increased yearoveryear?
Are most of our customers from this demographic?
Does this project have valuable ambitions for the broader department?
How great is our hardworking customer success team?
How often do you triple check your work?
Consider using the following questions, instead:
What’s the distribution of our revenues over the past three months?
Where are our top 5 highspending cohorts from?
What are the different benefits of pursuing this project?
What are three good and bad traits of our customer success team?
What kind of quality assurance testing do you carry out on your deliverables?
Question Audit Checklist [The Head Game]:
Did I avoid creating any yes/no questions?
Would anyone in my team/department understand the question irrespective of their backgrounds?
Does the question need more than one sentence to express?
Is the question ‘balanced’  scope is not too broad that the question will never truly be answered, or too small that the resulting impact is minimal?
Is the question being skewed to what may be easier to answer for my/my team’s particular skillset(s)?
Exercises
Are the following examples of good questions? Are they vague or specific? What are the ranges of answers we could expect? How would you improve them?
How does rain affect goal percentage at a soccer match?
Did the Toronto Maple Leafs beat the Edmonton Oilers?
Did you like watching the Tokyo Olympics?
What types of recovery drinks do hockey players drink?
How many medals will Canada win at the Paris 2024 Olympics?
Should we fund the Canadian Basketball team more than the Canadian Hockey team?
7.5.2 Structuring and Organizing Data
Let us now resume the discussion that was cut short in What Is Data? and From Objects and Attributes to Datasets.
Data Sources
We cannot have insights from data without data. As with many of the points we have made, this may seem trivially obvious, but there are many aspects of data acquisition, structuring, and organization that have a sizable impact on what insights can be squeezed from data.
More specifically, there are a number of questions that can be considered:
why do we collect data?
what can we do with data?
where does data come from?
what does “a collection” of data look like?
how can we describe data?
do we need to distinguish between data, information, knowledge?^{128}
Historically, data has had three functions:
record keeping – people/societal management;
science – new general knowledge, and
intelligence – business, military, police, social, domestic, personal.
Traditionally, each of these functions has
used different sources of information;
collected different types of data, and
had different data cultures and terminologies.
As data science is an interdisciplinary field, it should come as no surprise that we may run into all of them on the same project (see Figure 7.10).
Ultimately, data is generated from making observations about and taking measurements of the world. In the process of doing so, we are already imposing particular conceptualizations and assumptions on our raw experience.
More concretely, data comes from a variety of sources, including:
records of activity;
(scientific) observations;
sensors and monitoring, and
more frequently lately, from computers themselves.
As discussed in Section The Analog/Digital Data Dichotomy, although data may be collected and recorded by hand, it is fast becoming a mostly digital phenomenon.
Computer science (and information science) has its own theoretical, fundamental viewpoint about data and information, operating over data in a fundamental sense – 1s and 0s that represent numbers, letters, etc. Pragmatically, the resulting data is now stored on computers, and is accessible through our worldwide computer network.
While data is necessarily a representation of something else, analysts should endeavour to remember that the data itself still has physical properties: it takes up physical space and requires energy to work with.
In keeping with this physical nature, data also has a shelf life – it ages over time. We use the phrase “rotten data” or “decaying data” in one of two senses:
literally, as the data storage medium might decay, but also
metaphorically, as when it no longer accurately represents the relevant objects and relationships (or even when those objects no longer exist in the same way) – compare with “analytical decay” (see Model Assessment and Life After Analysis).
Useful data must stay ‘fresh’ and ‘current’, and avoid going ‘stale’ – but that is both context and modeldependent!
Before the Data
The various datausing disciplines share some core (systems) concepts and elements, which should resonate with the systems modeling framework previously discussed in Conceptual Frameworks for Data Work:
all objects have attributes, whether concrete or abstract;
for multiple objects, there are relationships between these objects/attributes, and
all these elements evolve over time.
The fundamental relationships include:
part–whole;
is–a;
is–a–type–of;
cardinality (onetoone, onetomany, manytomany),
etc.,
while objectspecific relationships include:
ownership;
social relationship;
becomes;
leadsto,
etc.
Objects and Attributes
We can examine concretely the ways in which objects have properties, relationships and behaviours, and how these are captured and turned into data through observations and measurements, via the apple and sandwich example of What Is Data?.
There, we made observations of an apple instance, labeled the type of observation we made, and provided a value describing the observation. We can further use these labels when observing other apple instances, and associate new values for these new apple instances.
Regarding the fundamental and object specified relationships, we might be able to see that:
an apple is a type of fruit;
a sandwich is part of a meal;
this apple is owned by Jen;
this sandwich becomes fuel,
etc.
It is worth noting that while this all seems tediously obvious to adult humans, it is not so from the perspective of a toddler, or an artificial intelligence. Explicitly, “understanding” requires a basic grasp of:
categories;
instances;
types of attributes;
values of attributes, and
which of these are important or relevant to a specific situation or in general terms.
From Attributes to Datasets
Were we to run around in an apple orchard, measuring and jotting down the height, width and colour of 83 different apples completely haphazardly on a piece of paper, the resulting data would be of limited value; although it would technically have been recorded, it would be lacking in structure.
We would not be able to tell which values were heights and which were widths, and which colours or which widths were associated with which heights, and viceversa. Structuring the data using lists, tables, or even tree structures allows analysts to record and preserve a number of important relationships:
those between object types and instances, property/attribute types (sometimes also called fields, features or dimensions), and values;
those between one attribute value and another value (i.e., both of these values are connected to this object instance);
those between attribute types, in the case of hierarchical data, and
those between the objects themselves (e.g., this car is owned by this person).
Tables, also called flat files, are likely the most familiar strategy for structuring data in order to preserve and indicate relationships. In the digital age, however, we are developing increasingly sophisticated strategies to store the structure of relationships in the data, and finding new ways to work with these increasingly complex relationship structures.
Formally, a data model is an abstract (logical) description of both the dataset structure and the system, constructed in terms that can be implemented in data management software.In a sense, data models lie halfway between conceptual models and database implementations. The data proper relates to instances; the model to object types.
Ontologies provide an alternative representation of the system: simply put, they are structured, machinereadable collections of facts about a domain.^{129} In a sense, an ontology is an attempt to get closer to the level of detail of a full conceptual model, while keeping the whole machinereadable (see Figure 7.11 for an example).
Every time we move from a conceptual model to a specific type of model (a data model, a knowledge model), we lose some information. One way to preserve as much context as possible in these new models is to also provide rich metadata – data about the data! Metadata is crucial when it comes to successfully working with and across datasets. Ontologies can also play a role here, but that is a topic for another day.
Typically data is stored in a database. A major motivator for some of the new developments in types of databases and other data storing strategies is the increasing availability of unstructured and (socalled) ‘BLOB’ data.
Structured data is labeled, organized, and discrete, with a predefined and constrained form. With that definition, for instance, data that is collected via an eform that only uses dropdown menus is structured.
Unstructured data, by comparison, is not organized, and does not appear in a specific predefined data structure – the classical example is text in a document. The text may have to subscribe to specific syntactic and semantic rules to be understandable, but in terms of storage (where spelling mistakes and meaning are irrelevant), it is highly unstructured since any data entry is likely to be completely different from another one in terms of length, etc.
The acronym “BLOB” stands for Binary Large Object data, such as images, audio files, or general multimedia files. Some of these files can be structuredlike (all pictures taken from a single camera, say), but they are usually quite unstructured, especially in multimedia modes.
Not every type of database is wellsuited to all data types. Let us look at four currently popular database options in terms of fundamental data and knowledge modeling and structuring strategies:
keyvalue pairs (e.g. JSON);
triples (e.g. resource description framework – RDF));
graph databases, and
relational databases.
KeyValue Stores
In these, all data is simply stored as a giant list of keys and values, where the ‘key’ is a name or a label (possibly of an object) and the ‘value’ is a value associated with this key; triple stores operate on the same principle, but data is stored according to ‘subject – predicate – object’.
The following examples illustrate these concepts:
The apple type – apple colour keyvalue store might contain
Granny Smith  green
, andRed Delicious  red
.
The person – shoe size keyvalue store might contain
Jen Schellinck  women's size 7
, andColin Henein  men's size 10
.
Other keyvalue stores: word – definition, report name – report (document file), url – webpage.
Triples stores just add a verb to the mix: person – is – age might contain
Elowyn  is  18
;Llewellyn  is  8
, andGwynneth  is  4
;
while object – iscolour – colour might contain
apple  iscolour  red
, andapple  iscolour  green
.
Both strategies results in a large amount of flexibility when it comes to the ‘design’ of the data storage, and not much needs to be known about the data structure prior to implementation. Additionally, missing values do not take any space in such stores.
In terms of their implementation, the devil is in the details; note that their extreme flexibility can also be a flaw [139], and it can be difficult to query them and find the data of interest.
Graph Databases
In graph databases, the emphasis is placed on the relationships between different types of objects, rather than between an object and the properties of that object:
the objects are represented by nodes;
the relationships between these objects are represented by edges, and
objects can have a relationship with other objects of the same type (such as person isasiblingof person).
They are fast and intuitive when using relationbased data, and might in fact be the only reasonable option to use in that case as traditional databases may slow to a crawl. But they are probably too specialized for non relationbased data, and they are not yet widely supported.
Relational Databases
In relational databases, data is stored in a series of tables. Broadly speaking, each table represents a type of object and some properties related to this type of object; special columns in tables connect object instances across tables (the entityrelationship model diagram (ERD) of Figure 7.4 is an example of a relational database model).
For instance, a person lives in a house, which has a particular address. Sometimes that property of the house will be stored in the table that stores information about individuals; in other cases, it will make more sense to store information about the house in its own table.
The form of relational databases are driven by the cardinality of the relationships (onetoone, onetomany, or manytomany). These concepts are illustrated in the cheat sheet found in Figure 7.12.
Relational databases are widely supported and well understood, and they work well for many types of systems and use cases. Note however, that it is difficult to modify them once they have been implemented and that, despite their name, they do not really handle relationships all that well.
Spreadsheets
We have said very little about keeping data in a single giant table (spreadsheet, flatfile), or multiple spreadsheets (we purposely kept it out of the original list of modeling and structuring strategies).
On the positive side, spreadsheets are very efficient when working with:
static data (e.g., it is only collected once), or
data about one particular type of object (e.g., scientific studies).
Most implementations of analytical algorithms require the data to be found in one location (such as an R
data frame). Since the data will eventually need to be exported to a flatfile anyway, why not remove the middle step and work with spreadsheets in the first place?
The problem is that it is hard to manage data integrity with spreadsheets over the long term when data is collected (and processed) continuously. Furthermore, flatfiles are not ideal when working with systems involving many different types of objects and their relationships, and they are not optimized for querying operations.
For small datasets or quickanddirty work, flatfiles are often a reasonable option, but analysts should look for alternatives when working on large scale projects.
All in all, we have provided very little in the way of concrete information on the topic of databases and data stores. Be aware that, time and time again, projects have sunk when this aspect of the process has not been taken seriously. Simply put, serious analyses cannot be conducted properly without the right data infrastructure.
Implementing a Model
In order to implement the data/knowledge model, data engineers and database specialists need access to data storage and management software. Gaining this access might be challenging for individuals or small teams as the required software traditionally runs on servers.
A server allows multiple users to access the database simultaneously, from different client programs. The other side of the coin is that servers make it difficult to ‘play’ with the database.
Userfriendly embedded database software (by opposition to clientserver database engines) such as SQLite can help overcome some of these obstacles. Data management software lets human agents interact easily with their data – in a nutshell, they are a human–data interface, through which
data can be added to a data collection;
subsets can be extracted from a data collection based on certain filters/criteria, and
data can be deleted from (or edited in) a data collection.
But tempora mutantur, nos et mutamur in illis^{130} – whereas we used to speak of:
databases and database management systems;
data warehouses (data management system designed to enable analytics);
data marts (used to retrieve clientfacing data, usually oriented to a specific business line or team);
Structured Query Language (SQL, a commonlyused programming language that helps manage (and perform operations on) relational databases),
we now speak of (see [141]):
data lakes (centralized repository in which to store structured/unstructured data alike);
data pools (a small collection of shared data that aspires to be a data lake, someday);
data swamps (unstructured, ungoverned, and out of control data lake in which data is hard to find/use and is consumed out of context, due to a lack of process, standards and governance);
database graveyards (where databases go to die?),
and data might be stored in nontraditional data structures, such as
Popular NoSQL database software include: ArangoDB, MongoDB, Redis, Amazon DynamoDB, OrientDB, Azure CosmosDB, Aerospike, etc.
Once a logical data model is complete, we need only:
instantiate it in the chosen software;
load the data, and
query the data.
Traditional relational databases use SQL; other types of databases either use other query languages (AQL, semantic engines, etc.) or rely on bespoke (tailored) computer programs (e.g. written in R, Python, etc.).
Once a data collection has been created, it must be managed, so that the data remains accurate, precise, consistent, and complete. Databases decay, after all; if a data lake turns into a data swamp, it will be difficult to squeeze usefulness out of it!
Data and Information Architectures
There is no single correct structure for a given collection of data (or dataset).
Rather, consideration must be given to:
the type of relationships that exist in the data/system (and are thought to be important);
the types of analysis that will be carried out, and
the data engineering requirements relating to the time and effort required to extract and work with the data.
The chosen structure, which stores and organizes the data, is called the data architecture. Designing a specific architecture for a data collection is a necessary part of the data analysis process. The data architecture is typically embedded in the larger data pipeline infrastructure described in Automated Data Pipelines.
As another example, automated data pipelines in the service delivery context are usually implemented with 9 components (5 stages, and 4 transitions, as in Figure 7.13):
data collection
data storage
data preparation
data analysis
data presentation
Note that model validation could be added as a sixth stage, to combat model “drift”.
By analogy with the human body, the data storage component, which houses the data and its architecture, is the “heart” of the pipeline (the engine that makes the pipeline go), whereas the data analysis component is its “brain.”^{131}
Most analysts are familiar with mathematical and statistical models which are implemented in the data analysis component. Data models, by contrast, tend to get constructed separately from the analytical models at the data storage phase. This separation can be problematic if the analytical model is not compatible with the data model. As an example, if an analyst needs a flatfile (with variables represented as columns) to feed into an algorithm implemented in R
, say.
If the data comes from forms with various fields stored in a relational database, the discrepancy could create difficulties on the data preparation side of the process.
Building both the analytical model and the data model off of a common conceptual model might help the data science team avoid such quandaries.
In essence, the task is to structure and organize both data and knowledge so that it can be:
stored in a useful manner;
added to easily;
usefully and efficiently extracted from that store (the “extracttransformload” (ETL) paradigm), and
operated over by humans and computers alike (programs, bots, A.I.) with minimal external modification.
7.5.3 Basic Data Analysis Techniques
Business Intelligence (BI) has evolved over the years:
we started to recognize that data could be used to gain a competitive advantage at the end of the 19th century;
the 1950s saw the first business database for decision support;
in the 1980s and 1990s, computers and data became increasingly available (data warehouses, data mining);
in the 2000s, the trend was to take business analytics out of the hands of data miners (and other specialists) and into the hands of domain experts,
now, big data and specialized techniques have arrived on the scene, as have data visualization, dashboards, and softwareasservice.
Historically, BI has been one of the streams contributing to modernday data science, via:
system of interest: the commercial realm, specifically, the market of interest;
sources of data: transaction data, financial data, sales data, organizational data;
goals: provide awareness of competitors, consumers and internal activity and use this to support decision making,
culture and preferred techniques: data marts, key performance indicators, consumer behaviour, slicing and dicing, business ‘facts’.
But no matter the realm in which we work, the ultimate goal remains the same: obtaining actionable insight into the system of interest. This can be achieved in a number of ways. Traditionally, analysts hope to do so by seeking:
patterns – predictable, repeating regularities;
structure – the organization of elements in a system, and
generalization – the creation of general or abstract concepts from specific instances (see Figure 7.14).
The underlying analytical hope is to find patterns or structure in the data from which actionable insights arise.
While finding patterns and structure can be interesting in its own right (in fact, this is the ultimate reward for many scientists), in the data science context it is how these discoveries are used that trumps all.
Variable Types
In the example of a conceptual model shown in Figure 7.5, we have identified different types of variables. In an experimental setting, we typically encounter:
control/extraneous variables – we do our best to keep these controlled and unchanging while other variables are changed;
independent variables – we control their values as we suspect they influence the dependent variables,
dependent variables – we do not control their values; they are generated in some way during the experiment, and presumed dependent on the other factors.
For instance, we could be interested in the plant height (dependent) given the mean number of sunlight hours (independent), given the region of the country in which each test site is located (control).
Data Types
These variables need not be of the same type. In a typical dataset, we may encounter:
numerical data – integers or numerics, such as \(1\), \(7\), \(34.654\), \(0.000004\), etc.;
text data – strings of text, which may be restricted to a certain number of characters, such as “Welcome to the park”, “AAAAA”, “345”, “45.678”, etc.;
categorical data – are variables with a fixed number of values, may be numeric or represented by strings, but for which there is no specific or inherent ordering, such as (‘red’,‘blue’,‘green’), (‘1’,‘2’,‘3’), etc.,
ordinal data – categorical data with an inherent ordering; unlike integer data, the spacing between values is not welldefined (very cold, cold, tepid, warm, super hot).
We shall use the following artificial dataset to illustrate some of the concepts.
set.seed(0)
n.sample = 165
colour=factor(c("red","blue","green"))
p.colour=c(40,15,5)
year=factor(c(2012,2013))
p.year=c(60,40)
quarter=factor(c("Q1","Q2","Q3","Q4"))
p.quarter=c(20,25,30,35)
signal.mean=c(14,2,123)
p.signal.mean=c(5,3,1)
signal.sd=c(2,8,15)
p.signal.sd=c(2,3,4)
s.colour < sample(length(colour), n.sample, prob=p.colour, replace=TRUE)
s.year < sample(length(year), n.sample, prob=p.year, replace=TRUE)
s.quarter < sample(length(quarter), n.sample, prob=p.quarter, replace=TRUE)
s.mean < sample(length(signal.mean), n.sample, prob=p.signal.mean, replace=TRUE)
s.sd < sample(length(signal.sd), n.sample, prob=p.signal.mean, replace=TRUE)
signal < rnorm(n.sample,signal.mean[s.mean], signal.sd[s.sd])
new_data < data.frame(colour[s.colour],year[s.year],quarter[s.quarter],signal)
colnames(new_data) < c("colour","year","quarter","signal")
new_data >
dplyr::slice_head(n = 10) >
knitr::kable(
caption = "The first ten rows of `new_data`"
)
colour  year  quarter  signal 

blue  2013  Q2  22.9981796 
red  2012  Q1  12.4557784 
red  2012  Q4  9.9353103 
red  2012  Q3  15.0472412 
blue  2013  Q2  6.1420338 
red  2012  Q4  13.4976708 
blue  2013  Q3  2.5600524 
green  2013  Q3  23.6368155 
red  2013  Q4  0.8701391 
red  2012  Q3  3.4207423 
We can transform categorical data into numeric data by generating frequency counts of the different values/levels of the categorical variable; regular analysis techniques could then be used on the now numeric variable.^{132}
Var1  Freq 

blue  41 
green  10 
red  114 
Categorical data plays a special role in data analysis:
in data science, categorical variables come with a predefined set of values;
in experimental science, a factor is an independent variable with its levels being defined (it may also be viewed as a category of treatment),
in business analytics, these are called dimensions (with members).
However they are labeled, these variable can be used to subset or roll up/summarize the data.
Hierarchical / Nested / Multilevel Data
When a categorical variable has multiple levels of abstraction, new categorical variables can be created from these levels. We can view these levels as new categorical variables, in a sense. The ‘new’ categorical variable has predefined relationships with the more detailed level.
This is commonly the case with time and space variables – we can ‘zoom’ in or out, as needed, which allows us discuss the granularity of the data, i.e., the ‘maximum zoom factor’ of the data.
For instance, observations could be recorded hourly, and then further processed (mean value, total, etc.) at the daily level, the monthly level, the quarterly level, the yearly level, etc., as seen below.
Let us start with the number of observations by year and quarter:
year  quarter  n 

2012  Q1  21 
2012  Q2  17 
2012  Q3  30 
2012  Q4  37 
2013  Q1  14 
2013  Q2  11 
2013  Q3  20 
2013  Q4  15 
We can also roll it up to the number of observations by year:
year  n 

2012  105 
2013  60 
Data Summarizing
The summary statistics of variables can help analysts gain basic univariate insights into the dataset (and hopefully, into the system with which it is associated).
These data summaries do not typically provide the full picture and connections/links between different variables are often missed altogether. Still, they often give analysts a reasonable sense for the data, at least for a first pass.
Common summary statistics include:
min – smallest value taken by a variable;
max – largest value taken by a variable;
median – “middle” value taken by a variable;
mean – average value taken by a variable;
mode – most frequent value taken by a variable;
# of obs – number of observations for a variable;
missing values – # of missing observations for a variable;
# of invalid entries – number of invalid entries for a variable;
unique values – unique values taken by a variable;
quartiles, deciles, centiles;
range, variance, standard deviation;
skew, kurtosis,
total, proportion, etc.
We can also perform operations over subsets of the data – typically over its columns, in effect compressing or ‘rolling up’ multiple data values into a single representative value, as below, say.
We start by creating a mode function (there isn’t one in R
):
The data can be summarized using:
new_data >
summarise(n = n(), signal.mean=mean(signal), signal.sd=sd(signal),
colour.mode=mode.R(colour))
n  signal.mean  signal.sd  colour.mode 

165  20.70894  38.39866  red 
Typical rollup functions include the ‘mean
’, ‘sum
’, ‘count
’, and ‘variance
’, but these do not always give sensical outcomes: if the variable measures a proportion, say, the sum of that variable over all observations is a meaningless quantity, on its own.
We can apply the same rollup function to many different columns, thus providing a mapping (list) of columns to values (as long as the computations all make sense – this might mean that all variables need to be of the same type in some cases).
We can map the mode to some dataset variables:
new_data >
summarise(year.mode=mode.R(year), quarter.mode=mode.R(quarter),
colour.mode=mode.R(colour))
year.mode  quarter.mode  colour.mode 

2012  Q4  red 
Datasets can also be summarized via contingency and pivot tables. A contingency table is used to examine the relationship between two categorical variables – specifically the frequency of one variable relative to a second variable (this is also known as crosstabulation).
Here is contingency table, by colour and year:
2012  2013  

blue  21  20 
green  6  4 
red  78  36 
A contingency table, by colour and quarter:
Q1  Q2  Q3  Q4  

blue  5  8  16  12 
green  2  0  5  3 
red  28  20  29  37 
A contingency table, by year and quarter:
Q1  Q2  Q3  Q4  

2012  21  17  30  37 
2013  14  11  20  15 
A pivot table, on the other hand, is a table generated in a software application by applying operations (e.g. ‘sum
’, ‘count
’, ‘mean
’) to variables, possibly based on another (categorical) variable, as below:
Here is a pivot table, signal characteristics by colour:
colour  n  signal.mean  signal.sd 

blue  41  25.58772  40.64504 
green  10  30.79947  49.71225 
red  114  18.06916  36.51887 
Contingency tables are a special instance of pivot tables (where the rollup function is ‘count
’).
Analysis Through Visualization
Consider the broad definition of analysis as:
identifying patterns or structure, and
adding meaning to these patterns or structure by interpreting them in the context of the system.
There are two general options to achieve this:
use analytical methods of varying degrees of sophistication, and/or
visualize the data and use the brain’s analytic (perceptual) power to reach meaningful conclusions about these patterns.
At this point, we will only list some simple visualization methods that are often (but not always) used to reveal patterns:
scatter plots are probably best suited for two numeric variables;
line charts, for numeric variable and ordinal variable;
bar charts for one categorical and one numeric, or multiple categorical/nested categorical data,
boxplots, histograms, bubble charts, small multiples, etc.
An indepth discussion of data visualization is given in Data Visualization; best practices and a more complete catalogue are provided in [143].
7.5.4 Common Statistical Procedures in R
The underlying goal of statistical analysis is to reach an understanding of the data. In this section, we show how some of the most common basic statistical concepts that can help analysts reach that goal are implemented in R
; a more thorough treatment of probability and statistics notions can be found in Math & Stats Overview.
Once the data is properly organized and visual exploration has begun in earnest, the typical next step is to describe the distribution of each variable numerically, followed by an exploration of the relationships among selected variables.
The objective is to answer questions such as:
What kind of mileage are cars getting these days? Specifically, what’s the distribution of miles per gallon (mean, standard deviation, median, range, and so on) in a survey of automobile makes and models?
After a new drug trial, what is the outcome (no improvement, some improvement, marked improvement) for drug versus placebo groups? Does the sex of the participants have an impact on the outcome?
What is the correlation between income and life expectancy? Is it significantly different from zero?
Are you more likely to receive imprisonment for a crime in different regions of Canada? Are the differences between regions statistically significant?
Basic Statistics
When it comes to calculating descriptive statistics, R
can basically do it all.
We start with functions that are included in the base installation. We will then look for extensions that are available through the use of usercontributed packages.
For illustrative purposes, we will use several of the variables from the Motor Trend Car Road Tests (mtcars
) dataset provided in the base installation: we will focus on miles per gallon (mpg
), horsepower (hp
), and weight (wt
):
mpg  hp  wt  

Mazda RX4  21.0  110  2.620 
Mazda RX4 Wag  21.0  110  2.875 
Datsun 710  22.8  93  2.320 
Hornet 4 Drive  21.4  110  3.215 
Hornet Sportabout  18.7  175  3.440 
Valiant  18.1  105  3.460 
Let us first take a look at descriptive statistics for all 32 models.
In the base installation, we can use the summary()
function to obtain descriptive statistics.
mpg  hp  wt  

Min. :10.40  Min. : 52.0  Min. :1.513  
1st Qu.:15.43  1st Qu.: 96.5  1st Qu.:2.581  
Median :19.20  Median :123.0  Median :3.325  
Mean :20.09  Mean :146.7  Mean :3.217  
3rd Qu.:22.80  3rd Qu.:180.0  3rd Qu.:3.610  
Max. :33.90  Max. :335.0  Max. :5.424 
The summary()
function provides the minimum, maximum, quartiles, and mean for numerical variables, and the respective frequencies for factors and logical vectors.
In base R
, the functions apply()
or sapply()
can be used to provide any descriptive statistics. The format in use is:
> `sapply(x, FUN, options)`
where \(x\) is the data frame (or matrix) and FUN
is an arbitrary function. If options are present, they’re passed to FUN
.
Typical functions that can be use include:
mean()
sd()
var()
min()
max()
median()
length()
range()
quantile()
fivenum()
The next example provides several descriptive statistics using sapply()
, including the skew and the kurtosis.
mystats < function(x, na.omit=FALSE){
if (na.omit)
x < x[!is.na(x)]
m < mean(x)
n < length(x)
s < sd(x)
skew < sum((xm)^3/s^3)/n
kurt < sum((xm)^4/s^4)/n  3
return(c(n=n, mean=m, stdev=s,
skew=skew, kurtosis=kurt))
}
mpg  hp  wt  

n  32.000000  32.0000000  32.0000000 
mean  20.090625  146.6875000  3.2172500 
stdev  6.026948  68.5628685  0.9784574 
skew  0.610655  0.7260237  0.4231465 
kurtosis  0.372766  0.1355511  0.0227108 
For cars in this sample, the mean mpg
is 20.1, with a standard deviation of 6.0. The distribution is skewed to the right (\(+0.61\)) and is somewhat flatter than a normal distribution (\(0.37\)). This is most evident if you graph the data.
To omit missing values for the computations, we would use the option na.omit=TRUE
.
First, we create a version of mtcars with some missing values:
mpg  hp  wt  

n  30.0000000  32.0000000  32.0000000 
mean  20.2400000  146.6875000  3.2172500 
stdev  6.1461847  68.5628685  0.9784574 
skew  0.5660728  0.7260237  0.4231465 
kurtosis  0.4870340  0.1355511  0.0227108 
Notice the difference in the mpg
summary.
The same table can be obtained using the dplyr
package functions instead (skewness()
and kurtosis()
are available in e1071
package).
mpg = dplyr::summarise(mtcars, n=n(), mean=mean(mpg),
stdev=sd(mpg), skew=e1071::skewness(mpg), kurt=e1071::kurtosis(mpg))
hp = dplyr::summarise(mtcars, n=n(), mean=mean(hp),
stdev=sd(hp), skew=e1071::skewness(hp), kurt=e1071::kurtosis(hp))
wt = dplyr::summarise(mtcars, n=n(), mean=mean(wt),
stdev=sd(wt), skew=e1071::skewness(wt), kurt=e1071::kurtosis(wt))
pivot = t(rbind(mpg,hp,wt))
colnames(pivot) < c("mpg","hp","wt")
mpg  hp  wt  

n  32.000000  32.0000000  32.0000000 
mean  20.090625  146.6875000  3.2172500 
stdev  6.026948  68.5628685  0.9784574 
skew  0.610655  0.7260237  0.4231465 
kurt  0.372766  0.1355511  0.0227108 
Hmisc
and pastecs
Several packages offer functions for descriptive statistics, including Hmisc
and pastecs
.
Hmisc
’s describe()
function returns the number of variables and observations, the number of missing and unique values, the mean, quantiles, and the five highest and lowest values.
mtcars[myvars]
3 Variables 32 Observations

mpg
n missing distinct Info Mean Gmd .05 .10
32 0 25 0.999 20.09 6.796 12.00 14.34
.25 .50 .75 .90 .95
15.43 19.20 22.80 30.09 31.30
lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9

hp
n missing distinct Info Mean Gmd .05 .10
32 0 22 0.997 146.7 77.04 63.65 66.00
.25 .50 .75 .90 .95
96.50 123.00 180.00 243.50 253.55
lowest : 52 62 65 66 91, highest: 215 230 245 264 335

wt
n missing distinct Info Mean Gmd .05 .10
32 0 29 0.999 3.217 1.089 1.736 1.956
.25 .50 .75 .90 .95
2.581 3.325 3.610 4.048 5.293
lowest : 1.513 1.615 1.835 1.935 2.140, highest: 3.845 4.070 5.250 5.345 5.424

The pastecs
package includes the function stat.desc()
that provides a
wide range of descriptive statistics:
> `stat.desc(x, basic=TRUE, desc=TRUE, norm=FALSE, p=0.95)`
where \(x\) is a data frame or a time series. If basic=TRUE
(the default), the number of values, null values, missing values, minimum, maximum, range, and sum are provided.
If desc=TRUE
(also the default), the median, mean, standard error of the mean, 95% confidence interval for the mean, variance, standard deviation, and coefficient of variation are also provided.
Finally, if norm=TRUE
(not the default), normal distribution statistics are returned, including skewness and kurtosis (with statistical significance) and the Shapiro–Wilk test of normality.
A \(p\)value option is used to calculate the confidence interval for the mean (.95 by default).
mpg  hp  wt  

nbr.val  32.0000000  32.0000000  32.0000000 
nbr.null  0.0000000  0.0000000  0.0000000 
nbr.na  0.0000000  0.0000000  0.0000000 
min  10.4000000  52.0000000  1.5130000 
max  33.9000000  335.0000000  5.4240000 
range  23.5000000  283.0000000  3.9110000 
sum  642.9000000  4694.0000000  102.9520000 
median  19.2000000  123.0000000  3.3250000 
mean  20.0906250  146.6875000  3.2172500 
SE.mean  1.0654240  12.1203173  0.1729685 
CI.mean.0.95  2.1729465  24.7195501  0.3527715 
var  36.3241028  4700.8669355  0.9573790 
std.dev  6.0269481  68.5628685  0.9784574 
coef.var  0.2999881  0.4674077  0.3041285 
We will take this opportunity to caution users against relying too heavily on one (or multiple) specific packages.
Correlations
Correlation coefficients are used to describe relationships among quantitative variables. The sign \(\pm\) indicates the direction of the relationship (positive or inverse), and the magnitude indicates the strength of the relationship (ranging from 0 for no linear relationship to 1 for a perfect linear relationship).
In this section, we look at a variety of correlation coefficients, as well as tests of significance. We will use the state.x77
dataset available in the base R installation. It provides data on the population, income, illiteracy rate, life expectancy, murder rate, and high school graduation rate for the 50 US states in 1977. There are also temperature and landarea measures, but we will not be using them. In addition to the base installation, we will be using the psych
and ggm
packages.
R
can produce a variety of correlation coefficients, including Pearson, Spearman, Kendall, partial, polychoric, and polyserial:
the Pearson productmoment coefficient assesses the degree of linear relationship between two quantitative variables;
Spearman’s rankorder coefficient assesses the degree of relationship between two rankordered variables,
Kendall’s tau coefficient is a nonparametric measure of rank correlation.
The cor()
function produces all three correlation coefficients, whereas the cov()
function provides covariances. There are many options, but a simplified format for producing correlations is
> `cor(x, use= , method= )`
where \(x\) is a matrix or a data frame, and use
specifies the handling
of missing data; it’s options are
all.obs
(assumes no missing data);everything
(any correlation involving a case with missing values will be set to missing);complete.obs
(listwise deletion), andpairwise.complete.obs
(pairwise deletion).
The method
specifies the type of correlation; its options are pearson
, spearman
, and kendall
.
The default options are use ="everything"
and method= "pearson"
.
For the builtin dataset state.x77
, which contains sociodemographic information about the 50 U.S. states from 1977, we find the following correlations:
Population  Income  Illiteracy  Life Exp  Murder  HS Grad  

Population  1.0000000  0.2082276  0.1076224  0.0680520  0.3436428  0.0984897 
Income  0.2082276  1.0000000  0.4370752  0.3402553  0.2300776  0.6199323 
Illiteracy  0.1076224  0.4370752  1.0000000  0.5884779  0.7029752  0.6571886 
Life Exp  0.0680520  0.3402553  0.5884779  1.0000000  0.7808458  0.5822162 
Murder  0.3436428  0.2300776  0.7029752  0.7808458  1.0000000  0.4879710 
HS Grad  0.0984897  0.6199323  0.6571886  0.5822162  0.4879710  1.0000000 
This produces the Pearson productmoment correlation coefficients. We can see, for example, that a strong positive correlation exists between income
and HS Grad
rate and that a strong negative correlation exists between Illiteracy
and Life Exp
.
A partial correlation is a correlation between two quantitative variables, controlling for one or more other quantitative variables; the pcor()
function in the ggm
package provides partial correlation coefficients (again, this package is not installed
by default, so it must be installed before first use).
The format is
> `pcor(u, S)`
where \(u\) is a vector of integers, with the
first two entries representing the indices of the variables to be correlated, and
remaining numbers being the indices of the conditioning variables (that is, the variables being partialled out),
and where \(S\) is the covariance matrix among the variables.
colnames(states)
ggm::pcor(c(1,5,2,3,6), cov(states))
ggm::pcor(c(1,5,2,3), cov(states))
ggm::pcor(c(1,5,2), cov(states))
[1] "Population" "Income" "Illiteracy" "Life Exp" "Murder"
[6] "HS Grad"
[1] 0.3462724
[1] 0.3621683
[1] 0.4113621
In this case, 0.346 is the correlation between population (variable 1) and murder rate (variable 5), controlling for the influence of income, illiteracy rate, and high school graduation rate (variables 2, 3, and 6 respectively).
The use of partial correlations is common in the social sciences.
Simple Linear Regression
In many ways, regression analysis is at the heart of statistics. It is a broad term for a set of methodologies used to predict a response variable (also called a dependent, criterion, or outcome variable) from one or more predictor variables (also called independent or explanatory variables).
In general, regression analysis can be used to:
identify the explanatory variables that are related to a response variable;
describe the form of the relationships involved, and
provide an equation for predicting the response variable from the explanatory variables.
For example, an exercise physiologist might use regression analysis to develop an equation for predicting the expected number of calories a person will burn while exercising on a treadmill.
In this example, the response variable is the number of calories burned (calculated from the amount of oxygen consumed), say, and the predictor variables might include:
duration of exercise (minutes);
percentage of time spent at their target heart rate;
average speed (mph);
age (years);
gender, and
body mass index (BMI).
From a practical point of view, regression analysis would help answer questions such as:
How many calories can a 30yearold man with a BMI of 28.7 expect to burn if he walks for 45 minutes at an average speed of 4 miles per hour and stays within his target heart rate 80% of the time?
What’s the minimum number of variables needed in order to accurately predict the number of calories a person will burn when walking?
R
has powerful and comprehensive features for fitting regression models – the abundance of options can be confusing.
The basic function for fitting a linear model is lm()
. The format is
> `myfit < lm(formula, data)`
where formula
describes the model to be fit and data
is the data frame containing the data to be used in fitting the model.
The resulting object (myfit
, in this case) is a list that contains extensive information about the fitted model.
The formula is typically written as \[Y \sim X_{1}+X_{2}+\cdots+X_{k}\] where the \(\sim\) separates the response variable on the left from the predictor variables on the right, and the predictor variables are separated by \(+\) signs.
In addition to lm()
, there are several functions that are useful when
generating regression models.
Function  Action 

summary() 
Displays detailed results for the fitted model 
coefficients() 
Lists the model parameters (intercept and slopes) for the fitted model 
confint() 
Provides confidence intervals for the model parameters (95% by default) 
residuals() 
Lists the residual values in a fitted model 
anova() 
Generates an ANOVA table for a fitted model, or an ANOVA table comparing two or more fitted models 
plot() 
Generates diagnostic plots for evaluating the fit of a model 
fitted() 
Extracts the fitted values for the dataset 
predict() 
Uses a fitted model to predict response values for a new dataset 
Each of these functions is applied to the object returned by lm()
in order to generate additional information based on the fitted model.
Example: the women
dataset in the base installation provides the heights and
weights for a set of 15 women aged 30 to 39. Assume that we are interested in predicting the weight of an individual from her height.^{133}
The linear regression on the data is obtained as follows:
Call:
lm(formula = weight ~ height, data = women)
Residuals:
Min 1Q Median 3Q Max
1.7333 1.1333 0.3833 0.7417 3.1167
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 87.51667 5.93694 14.74 1.71e09 ***
height 3.45000 0.09114 37.85 1.09e14 ***

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.525 on 13 degrees of freedom
Multiple Rsquared: 0.991, Adjusted Rsquared: 0.9903
Fstatistic: 1433 on 1 and 13 DF, pvalue: 1.091e14
From the output, you see that the prediction equation is \[\widehat{\text{Weight}} = 87.52 + 3.45\times \text{ Height}.\]
Because a height of 0 is impossible, there is no sense in trying to give a physical interpretation to the intercept – it merely becomes an adjustment constant (in other words, \(0\) is not in the domain of the model).
From the P(>t)
column, we see that the regression coefficient (3.45) is significantly different from zero (\(p < 0.001\)), which indicates that there’s an expected increase of 3.45 pounds of weight for every 1 inch increase in height. The multiple Rsquared coefficient (0.991) indicates that the model accounts for 99.1% of the variance in weights.
The individual weights (in pound) are:
[1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164
and their fitted values (and residuals) are
1 2 3 4 5 6 7 8
112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333
9 10 11 12 13 14 15
140.1833 143.6333 147.0833 150.5333 153.9833 157.4333 160.8833
1 2 3 4 5 6
2.41666667 0.96666667 0.51666667 0.06666667 0.38333333 0.83333333
7 8 9 10 11 12
1.28333333 1.73333333 1.18333333 1.63333333 1.08333333 0.53333333
13 14 15
0.01666667 1.56666667 3.11666667
Bootstrapping
Bootstrapping is a powerful and elegant approach to estimating the sampling distribution of specific statistics. It can be implemented in many situations where asymptotic results are difficult to find or otherwise unsatisfactory.
Bootstrapping proceeds using three steps:
resample the dataset (with replacement) many times over (typically on the order of 10,000);
calculate the desired statistic from each resampled dataset,
use the distribution of the resampled statistics to estimate the standard error of the statistic (normal approximation method) or construct a confidence interval using quantiles of that distribution (percentile method).
There are several ways to bootstrap in R
. As an example, say that we want to estimate the standard error and 95% confidence interval for the coefficient of variation (CV), defined as \(\sigma/\mu\), for a random variable \(X\).
We will illustrate the procedure with generated values of \(X\sim \mathcal{N}(1,1)\):
[1] 1.014057
The user must provide code to calculate the statistic of interest as a function.
The replicate()
function is the base R
tool for repeating function calls. Within that function, we nest a call to cvfun()
and a call to sample the data with replacement using the sample()
function.
We can also compute quantiles, as below:
2.5% 97.5%
0.9432266 1.0917185
This seems reasonable, as we would expect the CVs to be centered around 1.
The percentile interval is easy to calculate from the observed bootstrapped statistics. If the distribution of the bootstrap samples is approximately normally distributed, a \(t\) interval could be created by calculating the standard deviation of the bootstrap samples and finding the appropriate multiplier for the confidence interval. Plotting the bootstrap sample estimates is helpful to determine the form of the bootstrap distribution.
The framework can also be extended to include nonlinear models, correlated variables, probability estimation, and/or multivariate models; any book on statistical analysis contains at least one chapter or two on the topic (see [32], [144], for instance).
We will not pursue the topic further except to say that regression analysis is one of the arrows that every data scientist should have in their quiver.
7.5.5 Quantitative Methods
We provided a list of quantitative methods in Data Collection, Storage, Processing, and Modeling; we finish this section by expanding on a few of them.
Classification and Supervised Learning Tasks
Classification is one of the cornerstones of machine learning. Instead of trying to predict the numerical value of a response variable (as in regression), a classifier uses historical data [This training data usually consists of a randomly selected subset of the labeled (response) data.] to identify general patterns that could lead to observations belonging to one of several predefined categories.
For instance, if a car insurance company only has resources to investigate up to 20% of all filed claims, it could be useful for them to predict:
whether a claim is likely to be fraudulent;
whether a customer is likely to commit fraud in the near future;
whether an application for a policy is likely to result in a fraudulent claim,
the amount by which a claim will be reduced if it is fraudulent, etc.
Analysts and machine learning practitioners use a variety of different techniques to carry this process out (see Figure 7.16 for an illustration, and Machine Learning 101 and [2], [5], [6], in general, for more details), but the general steps always remain the same:
use training data to teach the classifier;
test/validate the classifier using holdout data,
if it passes the test, use the classifier to classify novel instances.
Some classifiers (such as deep learning neural nets) are ‘black boxes’: they might be very good at classification, but they are not explainable.
In some instances, that is an acceptable side effect of the process, in others, it might not be – if an individual is refused refugee status, say, they might rightly want to know why.
Unsupervised Learning Techniques
The hope of artificial intelligence is that intelligent behaviours will eventually be able to be automated. For the time being, however, that is still very much a work in progress.
But one of the challenges in that process is that not every intelligent behaviour arises from a supervised process.
Classification, for instance, is the prototypical supervised task: can we learn from historical/training examples? It seems like a decent approach to learning: evidence should drive the process.
But there are limitations to such an approach: it is difficult to make a conceptual leap solely on the basis of training data [if our experience in learning is anything to go by…], if only because the training data might not be representative of the system, or because the learner target task is too narrow.
In unsupervised learning, we learn without examples, based solely on what is found in the data. There is no specific question to answer (in the classification sense), other than “what can we learn from the data?”
Typical unsupervised learning tasks include:
clustering (novel categories);
association rules mining,
recommender systems, etc.
For instance, an online bookstore might want to make recommendations to customers concerning additional items to browse (and hopefully purchase) based on their buying patterns in prior transactions, the similarity between books, and the similarity between customer segments:
But what are those patterns?
How do we measure similarity?
What are the customer segments?
Can any of that information be used to create promotional bundles?
The lack of a specific target makes unsupervised learning much more difficult than supervised learning, as does the challenges of validating the results. This contributes to the proliferation of clustering algorithms and cluster quality metrics.
More general information and details on clustering can be found in Machine Learning 101 and in [4], [5], [145].
Other Machine Learning Tasks
These scratch but a miniscule part of the machine learning ecosystem. Other common tasks include [137]:
profiling and behaviour description;
link prediction;
data reduction,
influence/causal modeling, etc.
to say nothing of more sophisticated learning frameworks (semisupervised, reinforcement [146], deep learning [147], etc.).
Time Series Analysis and Process Monitoring
Processes are often subject to variability:
 variability due the cumulative effect of many small, essentially unavoidable causes (a process that only operates with such common causes is said to be in (statistical) control,
 variability due to special causes, such as improperly adjusted machines, poorly trained operators, defective materials, etc. (the variability is typically much larger for special causes, and such processes are said to be out of (statistical) control.
The aim of statistical process monitoring (SPM) is to identify occurrence of special causes. This is often done via time series analysis.
Consider \(n\) observations \(\{x_1,\ldots,x_n\}\) arising from some collection of processes. In practice, the index \(i\) is often a time index or a location index, i.e., the \(x_i\) are observed in sequence or in regions.^{134}
The processes that generate the observations could change from one time/location to the next due to:
external factors (war, pandemic, regime change, election results, etc.), or
internal factors (policy changes, modification of manufacturing process, etc.).
In such case, the mean and standard deviation alone might not provide a useful summary of the situation.
To get a sense of what is going on with the data (and the associated system), it could prove preferable to plot the data in the order that it has been collected (or according to geographical regions, or both).
The horizontal coordinate would then represent:
the time of collection \(t\) (order, day, week, quarter, year, etc.), or
the location \(i\) (country, province, city, branch, etc.).
The vertical coordinate represents the observations of interest \(x_t\) or \(x_i\) (see Figure @(fig:stockprices) for an example).
In process monitoring terms, we may be able to identify potential special causes by identifying trend breaks, cycles discontinuities, or level shifts in time series.
For instance, consider the three time series of Figure 7.18.
Is any action required?
in the first example (left), there are occasional drops in sales from one year to the next, but the upward trend is clear – we see the importance of considering the full time series; if only the last two points are presented to stockholders, say, they might conclude that action is needed, whereas the whole series paints a more positive outlook;
in the second case (middle), there is a cyclic effect with increases from Q1 to Q2 and from Q2 to Q3, but decreases from Q3 to Q4 and from Q4 to Q1. Overall, we also see an upward trend – the presence of regular patterns is a positive development,
 finally, in the last example (right), something clearly happened after the tenth week, causing a trend level shift. Whether it is due to internal or external factors depends on the context, which we do not have at our disposal, but some action certainly seems to be needed.
We might also be interested in using historical data to forecast the future behaviour of the variable. This is similar to the familiar analysis goals of:
finding patterns in the data, and
creating a (mathematical) model that captures the essence of these patterns.
Time series patterns can be quite complex and must often be broken down into multiple component models (trend, seasonal, irregular, etc.).
Typically, this can be achieved with fancy analysis methods, but it is not a simple topic, in general. Thankfully, there are software libraries that can help.
Anomaly Detection
The special points from process monitoring are anomalous in the sense that something unexpected happens there, something that changes the nature of the data pre and postbreak.
In a more general context, anomalous observations are those that are atypical or unlikely.
From an analytical perspective, anomaly detection can be approached using supervised, unsupervised, or conventional statistical methods.
The discipline is rich and vibrant (and the search for anomalies can end up being an arms race against the “bad guys”), but it is definitely one for which analysts should heed contextual understanding – blind analysis leads to blind alleys! A more thorough treatment is provided in [148].
There is a lot more to say on the topic of data analysis – we will delve into various topics in detail in subsequent modules.