7.5 Getting Insight From Data

With all of the appropriate context now in mind, we can finally turn to the main attraction, data analysis proper. Let us start this section with a few definitions, in order to distinguish between some of the common categories of data analysis.

What is Data Analysis?

We view finding patterns in data as being data analysis’s main goal. Alternatively, we could describe the data analysis process as using data to:

  • answer specific questions;

  • help in the decision-making process;

  • create models of the data;

  • describe or explain the situation or system under investigation,

  • etc.

While some practitioners include other analytical-like activities, such as testing (scientific) hypotheses, or carrying out calculations on data, we think of those as separate activities.

What is Data Science?

One of the challenges of working in the data science field is that nearly all quantitative work can be described as data science (often to a ridiculous extent).

Our simple definition paraphrases T. Kwartler: data science is the collection of processes by which we extract useful and actionable insights from data. Robinson [135] further suggests that these insights usually come via visualization and (manual) inferential analysis.

The noted data scientist H. Mason thinks of the discipline as “the working intersection of statistics, engineering, computer science, domain expertise, and ‘hacking’” [136].

What is Machine Learning?

Starting in the 1940s, researchers began to take seriously the idea that machines could be taught to learn, adapt and respond to novel situations.

A wide variety of techniques, accompanied by a great deal of theoretical underpinning, were created in an effort to achieve this goal.

Machine learning is typically used to obtain “predictions” (or “advice”), while reducing the operator’s analytical, inferential and decisional workload (although it is still present to some extent) [135].

What is Artificial/Augmented Intelligence?

The science fiction answer is that artificial intelligence is non-human intelligence that has been engineered rather than one that has evolved naturally. Practically speaking, this translates to “computers carrying out tasks that only humans can do”.

A.I. attempts to remove the need for oversight, allowing for automatic “actions” to be taken by a completely unattended system.

These goals are laudable in an academic setting, but we believe that stakeholders (and humans, in general) should not seek to abdicate all of their agency in the decision-making process. As such, we follow the lead of various thinkers and suggest further splitting A.I. into general A.I. (who would operate independently of human intelligence) and augmented intelligence (which enhances human intelligence).


These approaches can be further broken down into 4 core key buckets (see Figure 7.9), moving roughly from low value/low difficulty propositions (left) to high value/high difficulty propositions (right).

Analysis/data science  buckets [Marwan Kashef, datascience2go].

Figure 7.9: Analysis/data science buckets [Marwan Kashef, datascience2go].

For instance, a shoe store could conduct the following analyses:

Descriptive

Sales report

Diagnostic

Why did the sales take a large dip?

Predictive

What is the sales forecast next quarter?

Prescriptive:

How should we change the product mix to reach our target sales goal?

7.5.1 Asking the Right Questions

Definitions aside, however, data analysis, data science, machine learning, and artificial intelligence are about asking questions and providing answers to these questions. We might ask various types of questions, depending on the situation.

Our position is that, from a quantitative perspective, there are only really three types of questions:

  • analytics questions;

  • data science questions, and

  • quantitative methods questions.

Analytics questions could be something as simple as:

how many clicks did a specific link on my website get?

Data science questions tend to be more complex – we might ask something along the lines of:

if we know, historically, when or how often people click on links, can we predict how many people from Winnipeg will access a specific page on our website within the next three hours?

Whereas analytics-type questions are typically answered by counting things, data science-like questions are answered by using historical patterns to make predictions.

Quantitative methods questions might, in our view, be answered by making predictions but not necessarily based on historical data. We could build a model from first principles – the “physics” of the situation, as it were – to attempt to figure out what might happen.

For instance, if we thought there was a correlation between the temperature in Winnipeg and whether or not people click on the links in our website, then we might build a model that predicts “how many people from Winnipeg will access a page in the next week?”, say, by trying to predict the weather instead,126 which is not necessarily an easy task.

Analytics models do not usually predict or explain anything – they just report on the data, which is itself meant to represent the situation. A data mining or a data science model tends to be predictive, but not necessarily explanatory – it shows the existence of connections, of correlations, of links, but without explaining why the connections exist.

In a quantitative method model, we may start by assuming that we know what the links are, what the connections are – which presumably means that we have an idea as to why these connections exist127 – and then we try to explore the consequences of the existence of these connections and these links.

This leads to a singular realization that we share with new data scientists and analysts (potentially the single most important piece of advice they will receive in their quantitative career, and we are only half-joking when we say it):

not every situation calls for analytics, data science, statistical analysis, quantitative methods, machine learning, or A.I.

Take the time to identify instances where more is asked out of the data than what it can actually yield, and be prepared to warn stakeholders, as early as possible, when such a situation is encountered.

If we cannot ask the right questions of the data, of the client, of the situation, and so on, any associated project is doomed to fail from the very beginning. Without questions to answer, analysts are wasting their time, running analyses for the sake of analysis – the finish line cannot be reached if there is no finish line.

In order to help clients/stakeholders, data analysts and scientists need:

  • questions to answer;

  • questions that can be answered by the types of methods and skills at their disposal, and

  • answers that will be recognized as answers.

“How many clicks did this link get?” is a question that is easily answerable if we have a dataset of links and clicks, but it might not be a question that the client cares to see answered.

Data analysts and scientists often find themselves in a situation where they will ask the types of questions that can be answered with the available data, but the answers might not prove actually useful.

From a data science perspective, the right question is one that leads to actionable insights. And it might mean that old data is discarded and new data is collected in order to answer it. Analysts should beware: given the sometimes onerous price tag associated with data collection, it is not altogether surprising that there will sometimes be pressure from above to keep working with the available data.

Stay strong – analysis on the wrong dataset is the wrong analysis!

The Wrong Questions

Wrong questions might be:

  • questions that are too broad or too narrow;

  • questions that no amount of data could ever answer,

  • questions for which data cannot reasonably be obtained, etc.

One of the issues with “wrong” questions is that they do not necessarily “break the pipeline”:

  • in the best-case scenario, stakeholders, clients, colleagues will still recognize the answers as irrelevant.

  • in the worst-case scenario, policies will erroneously be implemented (or decisions made) on the basis of answers that have not been identified as misleading and/or useless.

Framing Questions

In general, data science questions are used to:

  • solve problems (fix pressing issues, understand why something is or isn’t happening, etc.);

  • create meaningful change (create new standards in the company, etc.),

  • support gut feelings (approve or disprove blind intuition).

One thing to note is that individuals prefer to answer a question quickly, especially in their area of expertise. It is also strongly suggested that analysts avoid glancing over the data before they settle on the question(s), to avoid “begging the question”. Finally, not that just as we can be blinded by love, we can also be blinded by solutions: the right solution to the right question is not necessarily the “sexiest” solution.

The website kdnuggets.com suggests the following roadmap to framing questions:

  • Understand the problem (opportunity vs problem)

  • What initial assumptions do I have about the situation?

  • How will the results be used?

  • What are the risks and/or benefits of answering this question?

  • What stakeholder questions might arise based on the answer(s)?

  • Do I have access to the data necessary for answering this question?

  • How will I measure my “success” criteria?

Example: Should I buy a house? But this is a bit vague; perhaps, instead, the question could be: should I buy a single house in Scotland? [based on an example by M. Kashef]

Answer: Let’s use the roadmap.

  • Understand the problem. I’ve been renting for two years and feel like I’m throwing my money away. I want a chance to invest in my own space instead of someone else’s.

  • What initial assumptions do I have about the situation? It’s going to be expensive but worth it – it’ll be an investment that appreciates over time.

  • How will the results be used? Either to buy a house or rent a bit longer to save more for a larger down payment.

  • What are the risks and/or benefits of answering this question? Risk: I could put myself under immense debt and become “house poor”. Benefits: I could get into the market just in time to make a fortune, and I won’t have to live under the uncertainty from my landlord possibly selling his home.

  • What stakeholder questions might arise based on the answer(s)? Would this new home be in an area that’s safe for kids? Will it be close to my workplace?

  • Do I have access to the data necessary to answer this question? Yes, through my real estate agent and online real estate brokerages, I can keep my finger on the pulse of the market.

  • How will I measure my “success” criteria? If I manage to buy a forever home within my $600k budget, say.

Additional Considerations

Specific questions are preferred over vague questions; questions that encourage qualification/quantification are preferred over Yes/No questions.

Here are some examples of Yes/No questions, which should be avoided [Health Families BC]:

  • Is our revenue increasing over time? Has it increased year-over-year?

  • Are most of our customers from this demographic?

  • Does this project have valuable ambitions for the broader department?

  • How great is our hard-working customer success team?

  • How often do you triple check your work?

Consider using the following questions, instead:

  • What’s the distribution of our revenues over the past three months?

  • Where are our top 5 high-spending cohorts from?

  • What are the different benefits of pursuing this project?

  • What are three good and bad traits of our customer success team?

  • What kind of quality assurance testing do you carry out on your deliverables?

Question Audit Checklist [The Head Game]:

  1. Did I avoid creating any yes/no questions?

  2. Would anyone in my team/department understand the question irrespective of their backgrounds?

  3. Does the question need more than one sentence to express?

  4. Is the question ‘balanced’ - scope is not too broad that the question will never truly be answered, or too small that the resulting impact is minimal?

  5. Is the question being skewed to what may be easier to answer for my/my team’s particular skillset(s)?

Exercises

Are the following examples of good questions? Are they vague or specific? What are the ranges of answers we could expect? How would you improve them?

  1. How does rain affect goal percentage at a soccer match?

  2. Did the Toronto Maple Leafs beat the Edmonton Oilers?

  3. Did you like watching the Tokyo Olympics?

  4. What types of recovery drinks do hockey players drink?

  5. How many medals will Canada win at the Paris 2024 Olympics?

  6. Should we fund the Canadian Basketball team more than the Canadian Hockey team?

7.5.2 Structuring and Organizing Data

Let us now resume the discussion that was cut short in What Is Data? and From Objects and Attributes to Datasets.

Data Sources

We cannot have insights from data without data. As with many of the points we have made, this may seem trivially obvious, but there are many aspects of data acquisition, structuring, and organization that have a sizable impact on what insights can be squeezed from data.

More specifically, there are a number of questions that can be considered:

  • why do we collect data?

  • what can we do with data?

  • where does data come from?

  • what does “a collection” of data look like?

  • how can we describe data?

  • do we need to distinguish between data, information, knowledge?128

 
Historically, data has had three functions:

  • record keeping – people/societal management;

  • science – new general knowledge, and

  • intelligence – business, military, police, social, domestic, personal.

Traditionally, each of these functions has

  • used different sources of information;

  • collected different types of data, and

  • had different data cultures and terminologies.

As data science is an interdisciplinary field, it should come as no surprise that we may run into all of them on the same project (see Figure 7.10).

Different data cultures and terms.

Figure 7.10: Different data cultures and terms.

Ultimately, data is generated from making observations about and taking measurements of the world. In the process of doing so, we are already imposing particular conceptualizations and assumptions on our raw experience.

More concretely, data comes from a variety of sources, including:

  • records of activity;

  • (scientific) observations;

  • sensors and monitoring, and

  • more frequently lately, from computers themselves.

As discussed in Section The Analog/Digital Data Dichotomy, although data may be collected and recorded by hand, it is fast becoming a mostly digital phenomenon.

Computer science (and information science) has its own theoretical, fundamental viewpoint about data and information, operating over data in a fundamental sense – 1s and 0s that represent numbers, letters, etc. Pragmatically, the resulting data is now stored on computers, and is accessible through our world-wide computer network.

While data is necessarily a representation of something else, analysts should endeavour to remember that the data itself still has physical properties: it takes up physical space and requires energy to work with.

In keeping with this physical nature, data also has a shelf life – it ages over time. We use the phrase “rotten data” or “decaying data” in one of two senses:

  • literally, as the data storage medium might decay, but also

  • metaphorically, as when it no longer accurately represents the relevant objects and relationships (or even when those objects no longer exist in the same way) – compare with “analytical decay” (see Model Assessment and Life After Analysis).

Useful data must stay ‘fresh’ and ‘current’, and avoid going ‘stale’ – but that is both context- and model-dependent!

Before the Data

The various data-using disciplines share some core (systems) concepts and elements, which should resonate with the systems modeling framework previously discussed in Conceptual Frameworks for Data Work:

  • all objects have attributes, whether concrete or abstract;

  • for multiple objects, there are relationships between these objects/attributes, and

  • all these elements evolve over time.

The fundamental relationships include:

  • part–whole;

  • is–a;

  • is–a–type–of;

  • cardinality (one-to-one, one-to-many, many-to-many),

  • etc.,

while object-specific relationships include:

  • ownership;

  • social relationship;

  • becomes;

  • leads-to,

  • etc.

Objects and Attributes

We can examine concretely the ways in which objects have properties, relationships and behaviours, and how these are captured and turned into data through observations and measurements, via the apple and sandwich example of What Is Data?.

There, we made observations of an apple instance, labeled the type of observation we made, and provided a value describing the observation. We can further use these labels when observing other apple instances, and associate new values for these new apple instances.

Regarding the fundamental and object specified relationships, we might be able to see that:

  • an apple is a type of fruit;

  • a sandwich is part of a meal;

  • this apple is owned by Jen;

  • this sandwich becomes fuel,

  • etc.

It is worth noting that while this all seems tediously obvious to adult humans, it is not so from the perspective of a toddler, or an artificial intelligence. Explicitly, “understanding” requires a basic grasp of:

  • categories;

  • instances;

  • types of attributes;

  • values of attributes, and

  • which of these are important or relevant to a specific situation or in general terms.

From Attributes to Datasets

Were we to run around in an apple orchard, measuring and jotting down the height, width and colour of 83 different apples completely haphazardly on a piece of paper, the resulting data would be of limited value; although it would technically have been recorded, it would be lacking in structure.

We would not be able to tell which values were heights and which were widths, and which colours or which widths were associated with which heights, and vice-versa. Structuring the data using lists, tables, or even tree structures allows analysts to record and preserve a number of important relationships:

  • those between object types and instances, property/attribute types (sometimes also called fields, features or dimensions), and values;

  • those between one attribute value and another value (i.e., both of these values are connected to this object instance);

  • those between attribute types, in the case of hierarchical data, and

  • those between the objects themselves (e.g., this car is owned by this person).

Tables, also called flat files, are likely the most familiar strategy for structuring data in order to preserve and indicate relationships. In the digital age, however, we are developing increasingly sophisticated strategies to store the structure of relationships in the data, and finding new ways to work with these increasingly complex relationship structures.

Formally, a data model is an abstract (logical) description of both the dataset structure and the system, constructed in terms that can be implemented in data management software.In a sense, data models lie halfway between conceptual models and database implementations. The data proper relates to instances; the model to object types.

Ontologies provide an alternative representation of the system: simply put, they are structured, machine-readable collections of facts about a domain.129 In a sense, an ontology is an attempt to get closer to the level of detail of a full conceptual model, while keeping the whole machine-readable (see Figure 7.11 for an example).

Representation of Langerhans cells in the *Cell Ontology*.

Figure 7.11: Representation of Langerhans cells in the Cell Ontology [138].

Every time we move from a conceptual model to a specific type of model (a data model, a knowledge model), we lose some information. One way to preserve as much context as possible in these new models is to also provide rich metadata – data about the data! Metadata is crucial when it comes to successfully working with and across datasets. Ontologies can also play a role here, but that is a topic for another day.

Typically data is stored in a database. A major motivator for some of the new developments in types of databases and other data storing strategies is the increasing availability of unstructured and (so-called) ‘BLOB’ data.

  • Structured data is labeled, organized, and discrete, with a pre-defined and constrained form. With that definition, for instance, data that is collected via an e-form that only uses drop-down menus is structured.

  • Unstructured data, by comparison, is not organized, and does not appear in a specific pre-defined data structure – the classical example is text in a document. The text may have to subscribe to specific syntactic and semantic rules to be understandable, but in terms of storage (where spelling mistakes and meaning are irrelevant), it is highly unstructured since any data entry is likely to be completely different from another one in terms of length, etc.

  • The acronym “BLOB” stands for Binary Large Object data, such as images, audio files, or general multi-media files. Some of these files can be structured-like (all pictures taken from a single camera, say), but they are usually quite unstructured, especially in multi-media modes.

Not every type of database is well-suited to all data types. Let us look at four currently popular database options in terms of fundamental data and knowledge modeling and structuring strategies:

  • key-value pairs (e.g. JSON);

  • triples (e.g. resource description framework – RDF));

  • graph databases, and

  • relational databases.

Key-Value Stores

In these, all data is simply stored as a giant list of keys and values, where the ‘key’ is a name or a label (possibly of an object) and the ‘value’ is a value associated with this key; triple stores operate on the same principle, but data is stored according to ‘subject – predicate – object’.

The following examples illustrate these concepts:

  1. The apple type – apple colour key-value store might contain

    • Granny Smith -- green, and

    • Red Delicious -- red.

  2. The person – shoe size key-value store might contain

    • Jen Schellinck -- women's size 7, and

    • Colin Henein -- men's size 10.

  3. Other key-value stores: word – definition, report name – report (document file), url – webpage.

  4. Triples stores just add a verb to the mix: person – is – age might contain

    • Elowyn -- is -- 18;

    • Llewellyn -- is -- 8, and

    • Gwynneth -- is -- 4;

    while object – is-colour – colour might contain

    • apple -- is-colour -- red, and

    • apple -- is-colour -- green.

Both strategies results in a large amount of flexibility when it comes to the ‘design’ of the data storage, and not much needs to be known about the data structure prior to implementation. Additionally, missing values do not take any space in such stores.

In terms of their implementation, the devil is in the details; note that their extreme flexibility can also be a flaw [139], and it can be difficult to query them and find the data of interest.

Graph Databases

In graph databases, the emphasis is placed on the relationships between different types of objects, rather than between an object and the properties of that object:

  • the objects are represented by nodes;

  • the relationships between these objects are represented by edges, and

  • objects can have a relationship with other objects of the same type (such as person is-a-sibling-of person).

They are fast and intuitive when using relation-based data, and might in fact be the only reasonable option to use in that case as traditional databases may slow to a crawl. But they are probably too specialized for non relation-based data, and they are not yet widely supported.

Relational Databases

In relational databases, data is stored in a series of tables. Broadly speaking, each table represents a type of object and some properties related to this type of object; special columns in tables connect object instances across tables (the entity-relationship model diagram (ERD) of Figure 7.4 is an example of a relational database model).

For instance, a person lives in a house, which has a particular address. Sometimes that property of the house will be stored in the table that stores information about individuals; in other cases, it will make more sense to store information about the house in its own table.

The form of relational databases are driven by the cardinality of the relationships (one-to-one, one-to-many, or many-to-many). These concepts are illustrated in the cheat sheet found in Figure 7.12.

Entity-relationship model diagram '(so-called) 'crow's foot' relationship symbols cheat sheet.

Figure 7.12: Entity-relationship model diagram ‘(so-called) ’crow’s foot’ relationship symbols cheat sheet [140].

Relational databases are widely supported and well understood, and they work well for many types of systems and use cases. Note however, that it is difficult to modify them once they have been implemented and that, despite their name, they do not really handle relationships all that well.

Spreadsheets

We have said very little about keeping data in a single giant table (spreadsheet, flatfile), or multiple spreadsheets (we purposely kept it out of the original list of modeling and structuring strategies).

On the positive side, spreadsheets are very efficient when working with:

  • static data (e.g., it is only collected once), or

  • data about one particular type of object (e.g., scientific studies).

Most implementations of analytical algorithms require the data to be found in one location (such as an R data frame). Since the data will eventually need to be exported to a flatfile anyway, why not remove the middle step and work with spreadsheets in the first place?

The problem is that it is hard to manage data integrity with spreadsheets over the long term when data is collected (and processed) continuously. Furthermore, flatfiles are not ideal when working with systems involving many different types of objects and their relationships, and they are not optimized for querying operations.

For small datasets or quick-and-dirty work, flatfiles are often a reasonable option, but analysts should look for alternatives when working on large scale projects.

All in all, we have provided very little in the way of concrete information on the topic of databases and data stores. Be aware that, time and time again, projects have sunk when this aspect of the process has not been taken seriously. Simply put, serious analyses cannot be conducted properly without the right data infrastructure.

Implementing a Model

In order to implement the data/knowledge model, data engineers and database specialists need access to data storage and management software. Gaining this access might be challenging for individuals or small teams as the required software traditionally runs on servers.

A server allows multiple users to access the database simultaneously, from different client programs. The other side of the coin is that servers make it difficult to ‘play’ with the database.

User-friendly embedded database software (by opposition to client-server database engines) such as SQLite can help overcome some of these obstacles. Data management software lets human agents interact easily with their data – in a nutshell, they are a human–data interface, through which

  • data can be added to a data collection;

  • subsets can be extracted from a data collection based on certain filters/criteria, and

  • data can be deleted from (or edited in) a data collection.

But tempora mutantur, nos et mutamur in illis130 – whereas we used to speak of:

  • databases and database management systems;

  • data warehouses (data management system designed to enable analytics);

  • data marts (used to retrieve client-facing data, usually oriented to a specific business line or team);

  • Structured Query Language (SQL, a commonly-used programming language that helps manage (and perform operations on) relational databases),

we now speak of (see [141]):

  • data lakes (centralized repository in which to store structured/unstructured data alike);

  • data pools (a small collection of shared data that aspires to be a data lake, someday);

  • data swamps (unstructured, ungoverned, and out of control data lake in which data is hard to find/use and is consumed out of context, due to a lack of process, standards and governance);

  • database graveyards (where databases go to die?),

and data might be stored in non-traditional data structures, such as

Popular NoSQL database software include: ArangoDB, MongoDB, Redis, Amazon DynamoDB, OrientDB, Azure CosmosDB, Aerospike, etc.

Once a logical data model is complete, we need only:

  1. instantiate it in the chosen software;

  2. load the data, and

  3. query the data.

Traditional relational databases use SQL; other types of databases either use other query languages (AQL, semantic engines, etc.) or rely on bespoke (tailored) computer programs (e.g. written in R, Python, etc.).

Once a data collection has been created, it must be managed, so that the data remains accurate, precise, consistent, and complete. Databases decay, after all; if a data lake turns into a data swamp, it will be difficult to squeeze usefulness out of it!

Data and Information Architectures

There is no single correct structure for a given collection of data (or dataset).

Rather, consideration must be given to:

  • the type of relationships that exist in the data/system (and are thought to be important);

  • the types of analysis that will be carried out, and

  • the data engineering requirements relating to the time and effort required to extract and work with the data.

The chosen structure, which stores and organizes the data, is called the data architecture. Designing a specific architecture for a data collection is a necessary part of the data analysis process. The data architecture is typically embedded in the larger data pipeline infrastructure described in Automated Data Pipelines.

As another example, automated data pipelines in the service delivery context are usually implemented with 9 components (5 stages, and 4 transitions, as in Figure 7.13):

  1. data collection

  2. data storage

  3. data preparation

  4. data analysis

  5. data presentation

Note that model validation could be added as a sixth stage, to combat model “drift”.

By analogy with the human body, the data storage component, which houses the data and its architecture, is the “heart” of the pipeline (the engine that makes the pipeline go), whereas the data analysis component is its “brain.”131

An implemented automated pipeline, with stages and transitions.

Figure 7.13: An implemented automated pipeline; note the transitions between the 5 stages.

Most analysts are familiar with mathematical and statistical models which are implemented in the data analysis component. Data models, by contrast, tend to get constructed separately from the analytical models at the data storage phase. This separation can be problematic if the analytical model is not compatible with the data model. As an example, if an analyst needs a flatfile (with variables represented as columns) to feed into an algorithm implemented in R, say.

If the data comes from forms with various fields stored in a relational database, the discrepancy could create difficulties on the data preparation side of the process.

Building both the analytical model and the data model off of a common conceptual model might help the data science team avoid such quandaries.

In essence, the task is to structure and organize both data and knowledge so that it can be:

  • stored in a useful manner;

  • added to easily;

  • usefully and efficiently extracted from that store (the “extract-transform-load” (ETL) paradigm), and

  • operated over by humans and computers alike (programs, bots, A.I.) with minimal external modification.

7.5.3 Basic Data Analysis Techniques

Business Intelligence (BI) has evolved over the years:

  1. we started to recognize that data could be used to gain a competitive advantage at the end of the 19th century;

  2. the 1950s saw the first business database for decision support;

  3. in the 1980s and 1990s, computers and data became increasingly available (data warehouses, data mining);

  4. in the 2000s, the trend was to take business analytics out of the hands of data miners (and other specialists) and into the hands of domain experts,

  5. now, big data and specialized techniques have arrived on the scene, as have data visualization, dashboards, and software-as-service.

Historically, BI has been one of the streams contributing to modern-day data science, via:

  • system of interest: the commercial realm, specifically, the market of interest;

  • sources of data: transaction data, financial data, sales data, organizational data;

  • goals: provide awareness of competitors, consumers and internal activity and use this to support decision making,

  • culture and preferred techniques: data marts, key performance indicators, consumer behaviour, slicing and dicing, business ‘facts’.

But no matter the realm in which we work, the ultimate goal remains the same: obtaining actionable insight into the system of interest. This can be achieved in a number of ways. Traditionally, analysts hope to do so by seeking:

  • patterns – predictable, repeating regularities;

  • structure – the organization of elements in a system, and

  • generalization – the creation of general or abstract concepts from specific instances (see Figure 7.14).

AFM image of 1,5,9-trioxo-13-azatriangulene and its chemical structure model.

Figure 7.14: AFM image of 1,5,9-trioxo-13-azatriangulene (left) and its chemical structure model (right) [142].

The underlying analytical hope is to find patterns or structure in the data from which actionable insights arise.

While finding patterns and structure can be interesting in its own right (in fact, this is the ultimate reward for many scientists), in the data science context it is how these discoveries are used that trumps all.

Variable Types

In the example of a conceptual model shown in Figure 7.5, we have identified different types of variables. In an experimental setting, we typically encounter:

  • control/extraneous variables – we do our best to keep these controlled and unchanging while other variables are changed;

  • independent variables – we control their values as we suspect they influence the dependent variables,

  • dependent variables – we do not control their values; they are generated in some way during the experiment, and presumed dependent on the other factors.

For instance, we could be interested in the plant height (dependent) given the mean number of sunlight hours (independent), given the region of the country in which each test site is located (control).

Data Types

These variables need not be of the same type. In a typical dataset, we may encounter:

  • numerical data – integers or numerics, such as \(1\), \(-7\), \(34.654\), \(0.000004\), etc.;

  • text data – strings of text, which may be restricted to a certain number of characters, such as “Welcome to the park”, “AAAAA”, “345”, “45.678”, etc.;

  • categorical data – are variables with a fixed number of values, may be numeric or represented by strings, but for which there is no specific or inherent ordering, such as (‘red’,‘blue’,‘green’), (‘1’,‘2’,‘3’), etc.,

  • ordinal data – categorical data with an inherent ordering; unlike integer data, the spacing between values is not well-defined (very cold, cold, tepid, warm, super hot).

We shall use the following artificial dataset to illustrate some of the concepts.

set.seed(0)
n.sample = 165
colour=factor(c("red","blue","green"))
p.colour=c(40,15,5)
year=factor(c(2012,2013))
p.year=c(60,40)
quarter=factor(c("Q1","Q2","Q3","Q4"))
p.quarter=c(20,25,30,35)
signal.mean=c(14,-2,123)
p.signal.mean=c(5,3,1)
signal.sd=c(2,8,15)
p.signal.sd=c(2,3,4)

s.colour <- sample(length(colour), n.sample, prob=p.colour, replace=TRUE)
s.year <- sample(length(year), n.sample, prob=p.year, replace=TRUE)
s.quarter <- sample(length(quarter), n.sample, prob=p.quarter, replace=TRUE)
s.mean <- sample(length(signal.mean), n.sample, prob=p.signal.mean, replace=TRUE)
s.sd <- sample(length(signal.sd), n.sample, prob=p.signal.mean, replace=TRUE)
signal <- rnorm(n.sample,signal.mean[s.mean], signal.sd[s.sd])
new_data <- data.frame(colour[s.colour],year[s.year],quarter[s.quarter],signal)
colnames(new_data) <- c("colour","year","quarter","signal")

new_data |>
  dplyr::slice_head(n = 10) |>
  knitr::kable(
    caption = "The first ten rows of `new_data`"
  )
Table 7.1: The first ten rows of new_data
colour year quarter signal
blue 2013 Q2 22.9981796
red 2012 Q1 12.4557784
red 2012 Q4 9.9353103
red 2012 Q3 15.0472412
blue 2013 Q2 6.1420338
red 2012 Q4 13.4976708
blue 2013 Q3 2.5600524
green 2013 Q3 23.6368155
red 2013 Q4 0.8701391
red 2012 Q3 -3.4207423

We can transform categorical data into numeric data by generating frequency counts of the different values/levels of the categorical variable; regular analysis techniques could then be used on the now numeric variable.132

table(new_data$colour)
Var1 Freq
blue 41
green 10
red 114

Categorical data plays a special role in data analysis:

  • in data science, categorical variables come with a pre-defined set of values;

  • in experimental science, a factor is an independent variable with its levels being defined (it may also be viewed as a category of treatment),

  • in business analytics, these are called dimensions (with members).

However they are labeled, these variable can be used to subset or roll up/summarize the data.

Hierarchical / Nested / Multilevel Data

When a categorical variable has multiple levels of abstraction, new categorical variables can be created from these levels. We can view these levels as new categorical variables, in a sense. The ‘new’ categorical variable has pre-defined relationships with the more detailed level.

This is commonly the case with time and space variables – we can ‘zoom’ in or out, as needed, which allows us discuss the granularity of the data, i.e., the ‘maximum zoom factor’ of the data.

For instance, observations could be recorded hourly, and then further processed (mean value, total, etc.) at the daily level, the monthly level, the quarterly level, the yearly level, etc., as seen below.

Let us start with the number of observations by year and quarter:

library(tidyverse)
new_data |> 
  group_by(year, quarter) |> 
  summarise(n = n())
year quarter n
2012 Q1 21
2012 Q2 17
2012 Q3 30
2012 Q4 37
2013 Q1 14
2013 Q2 11
2013 Q3 20
2013 Q4 15

We can also roll it up to the number of observations by year:

new_data |> 
  group_by(year) |> 
  summarise(n = n()) 
year n
2012 105
2013 60

Data Summarizing

The summary statistics of variables can help analysts gain basic univariate insights into the dataset (and hopefully, into the system with which it is associated).

These data summaries do not typically provide the full picture and connections/links between different variables are often missed altogether. Still, they often give analysts a reasonable sense for the data, at least for a first pass.

Common summary statistics include:

  • min – smallest value taken by a variable;

  • max – largest value taken by a variable;

  • median – “middle” value taken by a variable;

  • mean – average value taken by a variable;

  • mode – most frequent value taken by a variable;

  • # of obs – number of observations for a variable;

  • missing values – # of missing observations for a variable;

  • # of invalid entries – number of invalid entries for a variable;

  • unique values – unique values taken by a variable;

  • quartiles, deciles, centiles;

  • range, variance, standard deviation;

  • skew, kurtosis,

  • total, proportion, etc.

We can also perform operations over subsets of the data – typically over its columns, in effect compressing or ‘rolling up’ multiple data values into a single representative value, as below, say.

We start by creating a mode function (there isn’t one in R):

mode.R <- function(x) {
   unique.x <- unique(x)
   unique.x[which.max(tabulate(match(x, unique.x)))]
}

The data can be summarized using:

new_data |>  
 summarise(n = n(), signal.mean=mean(signal), signal.sd=sd(signal),
           colour.mode=mode.R(colour)) 
n signal.mean signal.sd colour.mode
165 20.70894 38.39866 red

Typical roll-up functions include the ‘mean’, ‘sum’, ‘count’, and ‘variance’, but these do not always give sensical outcomes: if the variable measures a proportion, say, the sum of that variable over all observations is a meaningless quantity, on its own.

We can apply the same roll-up function to many different columns, thus providing a mapping (list) of columns to values (as long as the computations all make sense – this might mean that all variables need to be of the same type in some cases).

We can map the mode to some dataset variables:

new_data |>  
 summarise(year.mode=mode.R(year), quarter.mode=mode.R(quarter), 
           colour.mode=mode.R(colour)) 
year.mode quarter.mode colour.mode
2012 Q4 red

Datasets can also be summarized via contingency and pivot tables. A contingency table is used to examine the relationship between two categorical variables – specifically the frequency of one variable relative to a second variable (this is also known as cross-tabulation).

Here is contingency table, by colour and year:

table(new_data$colour,new_data$year) 
2012 2013
blue 21 20
green 6 4
red 78 36

A contingency table, by colour and quarter:

table(new_data$colour,new_data$quarter) 
Q1 Q2 Q3 Q4
blue 5 8 16 12
green 2 0 5 3
red 28 20 29 37

A contingency table, by year and quarter:

table(new_data$year,new_data$quarter) 
Q1 Q2 Q3 Q4
2012 21 17 30 37
2013 14 11 20 15

A pivot table, on the other hand, is a table generated in a software application by applying operations (e.g. ‘sum’, ‘count’, ‘mean’) to variables, possibly based on another (categorical) variable, as below:

Here is a pivot table, signal characteristics by colour:

new_data |>  group_by(colour) |>
 summarise(n = n(), signal.mean=mean(signal), signal.sd=sd(signal)) 
colour n signal.mean signal.sd
blue 41 25.58772 40.64504
green 10 30.79947 49.71225
red 114 18.06916 36.51887

Contingency tables are a special instance of pivot tables (where the roll-up function is ‘count’).

Analysis Through Visualization

Consider the broad definition of analysis as:

  • identifying patterns or structure, and

  • adding meaning to these patterns or structure by interpreting them in the context of the system.

There are two general options to achieve this:

  1. use analytical methods of varying degrees of sophistication, and/or

  2. visualize the data and use the brain’s analytic (perceptual) power to reach meaningful conclusions about these patterns.

Analysis and pattern-reveal through visualization.

Figure 7.15: Analysis and pattern-reveal through visualization [personal file].

At this point, we will only list some simple visualization methods that are often (but not always) used to reveal patterns:

  • scatter plots are probably best suited for two numeric variables;

  • line charts, for numeric variable and ordinal variable;

  • bar charts for one categorical and one numeric, or multiple categorical/nested categorical data,

  • boxplots, histograms, bubble charts, small multiples, etc.

An in-depth discussion of data visualization is given in Data Visualization; best practices and a more complete catalogue are provided in [143].

7.5.4 Common Statistical Procedures in R

The underlying goal of statistical analysis is to reach an understanding of the data. In this section, we show how some of the most common basic statistical concepts that can help analysts reach that goal are implemented in R; a more thorough treatment of probability and statistics notions can be found in Math & Stats Overview.

Once the data is properly organized and visual exploration has begun in earnest, the typical next step is to describe the distribution of each variable numerically, followed by an exploration of the relationships among selected variables.

The objective is to answer questions such as:

  • What kind of mileage are cars getting these days? Specifically, what’s the distribution of miles per gallon (mean, standard deviation, median, range, and so on) in a survey of automobile makes and models?

  • After a new drug trial, what is the outcome (no improvement, some improvement, marked improvement) for drug versus placebo groups? Does the sex of the participants have an impact on the outcome?

  • What is the correlation between income and life expectancy? Is it significantly different from zero?

  • Are you more likely to receive imprisonment for a crime in different regions of Canada? Are the differences between regions statistically significant?

Basic Statistics

When it comes to calculating descriptive statistics, R can basically do it all.

We start with functions that are included in the base installation. We will then look for extensions that are available through the use of user-contributed packages.

For illustrative purposes, we will use several of the variables from the Motor Trend Car Road Tests (mtcars) dataset provided in the base installation: we will focus on miles per gallon (mpg), horsepower (hp), and weight (wt):

myvars <- c("mpg", "hp", "wt")
head(mtcars[myvars])
mpg hp wt
Mazda RX4 21.0 110 2.620
Mazda RX4 Wag 21.0 110 2.875
Datsun 710 22.8 93 2.320
Hornet 4 Drive 21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant 18.1 105 3.460

Let us first take a look at descriptive statistics for all 32 models.

In the base installation, we can use the summary() function to obtain descriptive statistics.

summary(mtcars[myvars])
mpg hp wt
Min. :10.40 Min. : 52.0 Min. :1.513
1st Qu.:15.43 1st Qu.: 96.5 1st Qu.:2.581
Median :19.20 Median :123.0 Median :3.325
Mean :20.09 Mean :146.7 Mean :3.217
3rd Qu.:22.80 3rd Qu.:180.0 3rd Qu.:3.610
Max. :33.90 Max. :335.0 Max. :5.424

The summary() function provides the minimum, maximum, quartiles, and mean for numerical variables, and the respective frequencies for factors and logical vectors.

In base R, the functions apply() or sapply() can be used to provide any descriptive statistics. The format in use is:

> `sapply(x, FUN, options)`

where \(x\) is the data frame (or matrix) and FUN is an arbitrary function. If options are present, they’re passed to FUN.

Typical functions that can be use include:

  • mean()

  • sd()

  • var()

  • min()

  • max()

  • median()

  • length()

  • range()

  • quantile()

  • fivenum()

The next example provides several descriptive statistics using sapply(), including the skew and the kurtosis.

mystats <- function(x, na.omit=FALSE){
                    if (na.omit)
                        x <- x[!is.na(x)]
                    m <- mean(x)
                    n <- length(x)
                    s <- sd(x)
                    skew <- sum((x-m)^3/s^3)/n
                    kurt <- sum((x-m)^4/s^4)/n - 3
                    return(c(n=n, mean=m, stdev=s, 
                             skew=skew, kurtosis=kurt))
                  }
sapply(mtcars[myvars], mystats)
mpg hp wt
n 32.000000 32.0000000 32.0000000
mean 20.090625 146.6875000 3.2172500
stdev 6.026948 68.5628685 0.9784574
skew 0.610655 0.7260237 0.4231465
kurtosis -0.372766 -0.1355511 -0.0227108
plot(mtcars[myvars])

For cars in this sample, the mean mpg is 20.1, with a standard deviation of 6.0. The distribution is skewed to the right (\(+0.61\)) and is somewhat flatter than a normal distribution (\(-0.37\)). This is most evident if you graph the data.

hist(mtcars$mpg)

hist(mtcars$hp)

hist(mtcars$wt)

To omit missing values for the computations, we would use the option na.omit=TRUE.

First, we create a version of mtcars with some missing values:

my.mtcars <- mtcars
my.mtcars[2,1] <- NA
my.mtcars[17,1] <- NA
knitr::kable(sapply(my.mtcars[myvars], mystats, na.omit=TRUE))
mpg hp wt
n 30.0000000 32.0000000 32.0000000
mean 20.2400000 146.6875000 3.2172500
stdev 6.1461847 68.5628685 0.9784574
skew 0.5660728 0.7260237 0.4231465
kurtosis -0.4870340 -0.1355511 -0.0227108

Notice the difference in the mpg summary.

The same table can be obtained using the dplyr package functions instead (skewness() and kurtosis() are available in e1071 package).

mpg = dplyr::summarise(mtcars, n=n(), mean=mean(mpg), 
                stdev=sd(mpg), skew=e1071::skewness(mpg), kurt=e1071::kurtosis(mpg))
hp = dplyr::summarise(mtcars, n=n(), mean=mean(hp), 
                stdev=sd(hp), skew=e1071::skewness(hp), kurt=e1071::kurtosis(hp))
wt = dplyr::summarise(mtcars, n=n(), mean=mean(wt), 
                stdev=sd(wt), skew=e1071::skewness(wt), kurt=e1071::kurtosis(wt))

pivot = t(rbind(mpg,hp,wt))
colnames(pivot) <- c("mpg","hp","wt")
pivot
mpg hp wt
n 32.000000 32.0000000 32.0000000
mean 20.090625 146.6875000 3.2172500
stdev 6.026948 68.5628685 0.9784574
skew 0.610655 0.7260237 0.4231465
kurt -0.372766 -0.1355511 -0.0227108

Hmisc and pastecs

Several packages offer functions for descriptive statistics, including Hmisc and pastecs.

Hmisc’s describe() function returns the number of variables and observations, the number of missing and unique values, the mean, quantiles, and the five highest and lowest values.

Hmisc::describe(mtcars[myvars])
mtcars[myvars] 

 3  Variables      32  Observations
--------------------------------------------------------------------------------
mpg 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      32        0       25    0.999    20.09    6.796    12.00    14.34 
     .25      .50      .75      .90      .95 
   15.43    19.20    22.80    30.09    31.30 

lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9
--------------------------------------------------------------------------------
hp 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      32        0       22    0.997    146.7    77.04    63.65    66.00 
     .25      .50      .75      .90      .95 
   96.50   123.00   180.00   243.50   253.55 

lowest :  52  62  65  66  91, highest: 215 230 245 264 335
--------------------------------------------------------------------------------
wt 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      32        0       29    0.999    3.217    1.089    1.736    1.956 
     .25      .50      .75      .90      .95 
   2.581    3.325    3.610    4.048    5.293 

lowest : 1.513 1.615 1.835 1.935 2.140, highest: 3.845 4.070 5.250 5.345 5.424
--------------------------------------------------------------------------------

The pastecs package includes the function stat.desc() that provides a wide range of descriptive statistics:

> `stat.desc(x, basic=TRUE, desc=TRUE, norm=FALSE, p=0.95)`

where \(x\) is a data frame or a time series. If basic=TRUE (the default), the number of values, null values, missing values, minimum, maximum, range, and sum are provided.

If desc=TRUE (also the default), the median, mean, standard error of the mean, 95% confidence interval for the mean, variance, standard deviation, and coefficient of variation are also provided.

Finally, if norm=TRUE (not the default), normal distribution statistics are returned, including skewness and kurtosis (with statistical significance) and the Shapiro–Wilk test of normality.

A \(p-\)value option is used to calculate the confidence interval for the mean (.95 by default).

pastecs::stat.desc(mtcars[myvars])
mpg hp wt
nbr.val 32.0000000 32.0000000 32.0000000
nbr.null 0.0000000 0.0000000 0.0000000
nbr.na 0.0000000 0.0000000 0.0000000
min 10.4000000 52.0000000 1.5130000
max 33.9000000 335.0000000 5.4240000
range 23.5000000 283.0000000 3.9110000
sum 642.9000000 4694.0000000 102.9520000
median 19.2000000 123.0000000 3.3250000
mean 20.0906250 146.6875000 3.2172500
SE.mean 1.0654240 12.1203173 0.1729685
CI.mean.0.95 2.1729465 24.7195501 0.3527715
var 36.3241028 4700.8669355 0.9573790
std.dev 6.0269481 68.5628685 0.9784574
coef.var 0.2999881 0.4674077 0.3041285

We will take this opportunity to caution users against relying too heavily on one (or multiple) specific packages.

Correlations

Correlation coefficients are used to describe relationships among quantitative variables. The sign \(\pm\) indicates the direction of the relationship (positive or inverse), and the magnitude indicates the strength of the relationship (ranging from 0 for no linear relationship to 1 for a perfect linear relationship).

In this section, we look at a variety of correlation coefficients, as well as tests of significance. We will use the state.x77 dataset available in the base R installation. It provides data on the population, income, illiteracy rate, life expectancy, murder rate, and high school graduation rate for the 50 US states in 1977. There are also temperature and land-area measures, but we will not be using them. In addition to the base installation, we will be using the psych and ggm packages.

R can produce a variety of correlation coefficients, including Pearson, Spearman, Kendall, partial, polychoric, and polyserial:

  • the Pearson product-moment coefficient assesses the degree of linear relationship between two quantitative variables;

  • Spearman’s rank-order coefficient assesses the degree of relationship between two rank-ordered variables,

  • Kendall’s tau coefficient is a nonparametric measure of rank correlation.

The cor() function produces all three correlation coefficients, whereas the cov() function provides covariances. There are many options, but a simplified format for producing correlations is

> `cor(x, use= , method= )`

where \(x\) is a matrix or a data frame, and use specifies the handling of missing data; it’s options are

  • all.obs (assumes no missing data);

  • everything (any correlation involving a case with missing values will be set to missing);

  • complete.obs (listwise deletion), and

  • pairwise.complete.obs (pairwise deletion).

The method specifies the type of correlation; its options are pearson, spearman, and kendall.

The default options are use ="everything" and method= "pearson".

For the built-in dataset state.x77, which contains socio-demographic information about the 50 U.S. states from 1977, we find the following correlations:

states<- state.x77[,1:6] 
cor(states)
Population Income Illiteracy Life Exp Murder HS Grad
Population 1.0000000 0.2082276 0.1076224 -0.0680520 0.3436428 -0.0984897
Income 0.2082276 1.0000000 -0.4370752 0.3402553 -0.2300776 0.6199323
Illiteracy 0.1076224 -0.4370752 1.0000000 -0.5884779 0.7029752 -0.6571886
Life Exp -0.0680520 0.3402553 -0.5884779 1.0000000 -0.7808458 0.5822162
Murder 0.3436428 -0.2300776 0.7029752 -0.7808458 1.0000000 -0.4879710
HS Grad -0.0984897 0.6199323 -0.6571886 0.5822162 -0.4879710 1.0000000

This produces the Pearson product-moment correlation coefficients. We can see, for example, that a strong positive correlation exists between income and HS Grad rate and that a strong negative correlation exists between Illiteracy and Life Exp.

A partial correlation is a correlation between two quantitative variables, controlling for one or more other quantitative variables; the pcor() function in the ggm package provides partial correlation coefficients (again, this package is not installed by default, so it must be installed before first use).

The format is

> `pcor(u, S)`

where \(u\) is a vector of integers, with the

  • first two entries representing the indices of the variables to be correlated, and

  • remaining numbers being the indices of the conditioning variables (that is, the variables being partialled out),

and where \(S\) is the covariance matrix among the variables.

colnames(states)
ggm::pcor(c(1,5,2,3,6), cov(states))
ggm::pcor(c(1,5,2,3), cov(states))
ggm::pcor(c(1,5,2), cov(states))
[1] "Population" "Income"     "Illiteracy" "Life Exp"   "Murder"    
[6] "HS Grad"   
[1] 0.3462724
[1] 0.3621683
[1] 0.4113621

In this case, 0.346 is the correlation between population (variable 1) and murder rate (variable 5), controlling for the influence of income, illiteracy rate, and high school graduation rate (variables 2, 3, and 6 respectively).

The use of partial correlations is common in the social sciences.

Simple Linear Regression

In many ways, regression analysis is at the heart of statistics. It is a broad term for a set of methodologies used to predict a response variable (also called a dependent, criterion, or outcome variable) from one or more predictor variables (also called independent or explanatory variables).

In general, regression analysis can be used to:

  • identify the explanatory variables that are related to a response variable;

  • describe the form of the relationships involved, and

  • provide an equation for predicting the response variable from the explanatory variables.

For example, an exercise physiologist might use regression analysis to develop an equation for predicting the expected number of calories a person will burn while exercising on a treadmill.

In this example, the response variable is the number of calories burned (calculated from the amount of oxygen consumed), say, and the predictor variables might include:

  • duration of exercise (minutes);

  • percentage of time spent at their target heart rate;

  • average speed (mph);

  • age (years);

  • gender, and

  • body mass index (BMI).

From a practical point of view, regression analysis would help answer questions such as:

  • How many calories can a 30-year-old man with a BMI of 28.7 expect to burn if he walks for 45 minutes at an average speed of 4 miles per hour and stays within his target heart rate 80% of the time?

  • What’s the minimum number of variables needed in order to accurately predict the number of calories a person will burn when walking?

R has powerful and comprehensive features for fitting regression models – the abundance of options can be confusing.

The basic function for fitting a linear model is lm(). The format is

> `myfit <- lm(formula, data)`

where formula describes the model to be fit and data is the data frame containing the data to be used in fitting the model.

The resulting object (myfit, in this case) is a list that contains extensive information about the fitted model.

The formula is typically written as \[Y \sim X_{1}+X_{2}+\cdots+X_{k}\] where the \(\sim\) separates the response variable on the left from the predictor variables on the right, and the predictor variables are separated by \(+\) signs.

In addition to lm(), there are several functions that are useful when generating regression models.

Function Action
summary() Displays detailed results for the fitted model
coefficients() Lists the model parameters (intercept and slopes) for the fitted model
confint() Provides confidence intervals for the model parameters (95% by default)
residuals() Lists the residual values in a fitted model
anova() Generates an ANOVA table for a fitted model, or an ANOVA table comparing two or more fitted models
plot() Generates diagnostic plots for evaluating the fit of a model
fitted() Extracts the fitted values for the dataset
predict() Uses a fitted model to predict response values for a new dataset

Each of these functions is applied to the object returned by lm() in order to generate additional information based on the fitted model.


Example: the women dataset in the base installation provides the heights and weights for a set of 15 women aged 30 to 39. Assume that we are interested in predicting the weight of an individual from her height.133

The linear regression on the data is obtained as follows:

fit <- lm(weight ~ height, data=women)
summary(fit)

Call:
lm(formula = weight ~ height, data = women)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared:  0.991, Adjusted R-squared:  0.9903 
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

From the output, you see that the prediction equation is \[\widehat{\text{Weight}} = -87.52 + 3.45\times \text{ Height}.\]

Because a height of 0 is impossible, there is no sense in trying to give a physical interpretation to the intercept – it merely becomes an adjustment constant (in other words, \(0\) is not in the domain of the model).

From the P(>|t|) column, we see that the regression coefficient (3.45) is significantly different from zero (\(p < 0.001\)), which indicates that there’s an expected increase of 3.45 pounds of weight for every 1 inch increase in height. The multiple R-squared coefficient (0.991) indicates that the model accounts for 99.1% of the variance in weights.

The individual weights (in pound) are:

women$weight
 [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164

and their fitted values (and residuals) are

fitted(fit)
residuals(fit)
       1        2        3        4        5        6        7        8 
112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333 
       9       10       11       12       13       14       15 
140.1833 143.6333 147.0833 150.5333 153.9833 157.4333 160.8833 
          1           2           3           4           5           6 
 2.41666667  0.96666667  0.51666667  0.06666667 -0.38333333 -0.83333333 
          7           8           9          10          11          12 
-1.28333333 -1.73333333 -1.18333333 -1.63333333 -1.08333333 -0.53333333 
         13          14          15 
 0.01666667  1.56666667  3.11666667 
plot(women$height,women$weight,
       xlab="Height (in inches)",
       ylab="Weight (in pounds)")
abline(fit)

Bootstrapping

Bootstrapping is a powerful and elegant approach to estimating the sampling distribution of specific statistics. It can be implemented in many situations where asymptotic results are difficult to find or otherwise unsatisfactory.

Bootstrapping proceeds using three steps:

  1. resample the dataset (with replacement) many times over (typically on the order of 10,000);

  2. calculate the desired statistic from each resampled dataset,

  3. use the distribution of the resampled statistics to estimate the standard error of the statistic (normal approximation method) or construct a confidence interval using quantiles of that distribution (percentile method).

There are several ways to bootstrap in R. As an example, say that we want to estimate the standard error and 95% confidence interval for the coefficient of variation (CV), defined as \(\sigma/\mu\), for a random variable \(X\).

We will illustrate the procedure with generated values of \(X\sim \mathcal{N}(1,1)\):

set.seed(0) # for replicability
x = rnorm(1000, mean=1)
hist(x)

(cv=sd(x)/mean(x))
[1] 1.014057

The user must provide code to calculate the statistic of interest as a function.

cvfun = function(x) { 
    return(sd(x)/mean(x))
}

The replicate() function is the base R tool for repeating function calls. Within that function, we nest a call to cvfun() and a call to sample the data with replacement using the sample() function.

res = replicate(50000, cvfun(sample(x, replace=TRUE)))
hist(res)

We can also compute quantiles, as below:

quantile(res, c(.025, .975))
     2.5%     97.5% 
0.9432266 1.0917185 

This seems reasonable, as we would expect the CVs to be centered around 1.

The percentile interval is easy to calculate from the observed bootstrapped statistics. If the distribution of the bootstrap samples is approximately normally distributed, a \(t\) interval could be created by calculating the standard deviation of the bootstrap samples and finding the appropriate multiplier for the confidence interval. Plotting the bootstrap sample estimates is helpful to determine the form of the bootstrap distribution.

The framework can also be extended to include non-linear models, correlated variables, probability estimation, and/or multivariate models; any book on statistical analysis contains at least one chapter or two on the topic (see [32], [144], for instance).

We will not pursue the topic further except to say that regression analysis is one of the arrows that every data scientist should have in their quiver.

7.5.5 Quantitative Methods

We provided a list of quantitative methods in Data Collection, Storage, Processing, and Modeling; we finish this section by expanding on a few of them.

Classification and Supervised Learning Tasks

Classification is one of the cornerstones of machine learning. Instead of trying to predict the numerical value of a response variable (as in regression), a classifier uses historical data [This training data usually consists of a randomly selected subset of the labeled (response) data.] to identify general patterns that could lead to observations belonging to one of several pre-defined categories.

For instance, if a car insurance company only has resources to investigate up to 20% of all filed claims, it could be useful for them to predict:

  • whether a claim is likely to be fraudulent;

  • whether a customer is likely to commit fraud in the near future;

  • whether an application for a policy is likely to result in a fraudulent claim,

  • the amount by which a claim will be reduced if it is fraudulent, etc.

Analysts and machine learning practitioners use a variety of different techniques to carry this process out (see Figure 7.16 for an illustration, and Machine Learning 101 and [2], [5], [6], in general, for more details), but the general steps always remain the same:

  1. use training data to teach the classifier;

  2. test/validate the classifier using hold-out data,

  3. if it passes the test, use the classifier to classify novel instances.

The trousers of classification.

Figure 7.16: The trousers of classification [personal file].

Some classifiers (such as deep learning neural nets) are ‘black boxes’: they might be very good at classification, but they are not explainable.

In some instances, that is an acceptable side effect of the process, in others, it might not be – if an individual is refused refugee status, say, they might rightly want to know why.

Unsupervised Learning Techniques

The hope of artificial intelligence is that intelligent behaviours will eventually be able to be automated. For the time being, however, that is still very much a work in progress.

But one of the challenges in that process is that not every intelligent behaviour arises from a supervised process.

Classification, for instance, is the prototypical supervised task: can we learn from historical/training examples? It seems like a decent approach to learning: evidence should drive the process.

But there are limitations to such an approach: it is difficult to make a conceptual leap solely on the basis of training data [if our experience in learning is anything to go by…], if only because the training data might not be representative of the system, or because the learner target task is too narrow.

In unsupervised learning, we learn without examples, based solely on what is found in the data. There is no specific question to answer (in the classification sense), other than “what can we learn from the data?”

Typical unsupervised learning tasks include:

  • clustering (novel categories);

  • association rules mining,

  • recommender systems, etc.

For instance, an online bookstore might want to make recommendations to customers concerning additional items to browse (and hopefully purchase) based on their buying patterns in prior transactions, the similarity between books, and the similarity between customer segments:

  • But what are those patterns?

  • How do we measure similarity?

  • What are the customer segments?

  • Can any of that information be used to create promotional bundles?

The lack of a specific target makes unsupervised learning much more difficult than supervised learning, as does the challenges of validating the results. This contributes to the proliferation of clustering algorithms and cluster quality metrics.

More general information and details on clustering can be found in Machine Learning 101 and in [4], [5], [145].

Other Machine Learning Tasks

These scratch but a miniscule part of the machine learning ecosystem. Other common tasks include [137]:

  • profiling and behaviour description;

  • link prediction;

  • data reduction,

  • influence/causal modeling, etc.

to say nothing of more sophisticated learning frameworks (semi-supervised, reinforcement [146], deep learning [147], etc.).

Time Series Analysis and Process Monitoring

Processes are often subject to variability:

  • variability due the cumulative effect of many small, essentially unavoidable causes (a process that only operates with such common causes is said to be in (statistical) control,
  • variability due to special causes, such as improperly adjusted machines, poorly trained operators, defective materials, etc. (the variability is typically much larger for special causes, and such processes are said to be out of (statistical) control.

The aim of statistical process monitoring (SPM) is to identify occurrence of special causes. This is often done via time series analysis.

Consider \(n\) observations \(\{x_1,\ldots,x_n\}\) arising from some collection of processes. In practice, the index \(i\) is often a time index or a location index, i.e., the \(x_i\) are observed in sequence or in regions.134

The processes that generate the observations could change from one time/location to the next due to:

  • external factors (war, pandemic, regime change, election results, etc.), or

  • internal factors (policy changes, modification of manufacturing process, etc.).

In such case, the mean and standard deviation alone might not provide a useful summary of the situation.

To get a sense of what is going on with the data (and the associated system), it could prove preferable to plot the data in the order that it has been collected (or according to geographical regions, or both).

The horizontal coordinate would then represent:

  • the time of collection \(t\) (order, day, week, quarter, year, etc.), or

  • the location \(i\) (country, province, city, branch, etc.).

The vertical coordinate represents the observations of interest \(x_t\) or \(x_i\) (see Figure @(fig:stock-prices) for an example).

Real S&P stock price index, earnings, dividends, and interest  rates, from 1871 to 2009.

Figure 7.17: Real S&P stock price index (red), earnings (blue), and dividends (green), together with interest rates (black), from 1871 to 2009 [R.J. Shiller].

In process monitoring terms, we may be able to identify potential special causes by identifying trend breaks, cycles discontinuities, or level shifts in time series.

For instance, consider the three time series of Figure 7.18.

Sales for 3 different products, measured in years, quarters, weeks.

Figure 7.18: Sales (in $10,000’s) for 3 different products – years (left), quarters (middle, but labeled in years), weeks (right) [personal file].

Is any action required?

  • in the first example (left), there are occasional drops in sales from one year to the next, but the upward trend is clear – we see the importance of considering the full time series; if only the last two points are presented to stockholders, say, they might conclude that action is needed, whereas the whole series paints a more positive outlook;

  • in the second case (middle), there is a cyclic effect with increases from Q1 to Q2 and from Q2 to Q3, but decreases from Q3 to Q4 and from Q4 to Q1. Overall, we also see an upward trend – the presence of regular patterns is a positive development,

  • finally, in the last example (right), something clearly happened after the tenth week, causing a trend level shift. Whether it is due to internal or external factors depends on the context, which we do not have at our disposal, but some action certainly seems to be needed.

We might also be interested in using historical data to forecast the future behaviour of the variable. This is similar to the familiar analysis goals of:

  • finding patterns in the data, and

  • creating a (mathematical) model that captures the essence of these patterns.

Time series patterns can be quite complex and must often be broken down into multiple component models (trend, seasonal, irregular, etc.).

Typically, this can be achieved with fancy analysis methods, but it is not a simple topic, in general. Thankfully, there are software libraries that can help.

Anomaly Detection

The special points from process monitoring are anomalous in the sense that something unexpected happens there, something that changes the nature of the data pre- and post-break.

In a more general context, anomalous observations are those that are atypical or unlikely.

From an analytical perspective, anomaly detection can be approached using supervised, unsupervised, or conventional statistical methods.

The discipline is rich and vibrant (and the search for anomalies can end up being an arms race against the “bad guys”), but it is definitely one for which analysts should heed contextual understanding – blind analysis leads to blind alleys! A more thorough treatment is provided in [148].


There is a lot more to say on the topic of data analysis – we will delve into various topics in detail in subsequent modules.

References

[2]
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, 2008.
[4]
C. C. Aggarwal and C. K. Reddy, Eds., Data Clustering: Algorithms and Applications. CRC Press, 2014.
[5]
C. C. Aggarwal, Data Mining: The Textbook. Cham: Springer, 2015.
[6]
C. C. Aggarwal, Ed., Data Classification: Algorithms and Applications. CRC Press, 2015.
[32]
R. V. Hogg and E. A. Tanis, Probability and Statistical Inference, 7th ed. Pearson/Prentice Hall, 2006.
[135]
[136]
D. Woods, Bitly’s Hilary Mason on "what is a data scientist?",” Forbes, Mar. 2012.
[137]
F. Provost and T. Fawcett, Data Science for Business. O’Reilly, 2015.
[138]
[139]
boot4life, What JSON structure to use for key-value pairs.” StackOverflow, Jun. 2016.
[140]
[141]
N. Feldman, Data Lake or Data Swamp?, 2015.
[142]
P. Hapala et al., “Mapping the electrostatic force field of single molecules from high-resolution scanning probe images,” Nature Communications, vol. 7, no. 11560, 2016.
[143]
P. Boily, S. Davies, and J. Schellinck, Practical Data Visualization. Data Action Lab/Quadrangle, 2022.
[144]
[145]
Wikipedia, Cluster analysis algorithms.”
[146]
R. Sutton and G. Barto, Reinforcement Learning: an Introduction. MIT Press, 2018.
[147]
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT press Cambridge, 2016.
[148]
Y. Cissokho, S. Fadel, R. Millson, R. Pourhasan, and P. Boily, Anomaly Detection and Outlier Analysis,” Data Science Report Series, 2020.