7.2 Conceptual Frameworks for Data Work
In simple terms, we use data to represent the world. But this is not the only strategy at our disposal: we might also (and in combination) describe the world using language, or represent it by building physical models.
The common thread is the more basic concept of representation – the idea that one object can stand in for another, and be used in its stead in order to indirectly engage with the object being represented. Humans are representational animals par excellence; our use of representations becomes almost undetectable to us, at times.
On some level, we do understand that “the map is not the territory”, but we do not have to make much of an effort to use the map to navigate the territory. The transition from the representation to the represented is typically quite seamless. This is arguably one of humanity’s major strengths, but in the world of data science it can also act as an Achilles’ heel, preventing analysts from working successfully with clients and project partners, and from appropriately transferring analytical results to the real world contexts that could benefit from them.
The best protection against these potential threats is the existence of a well thought out and explicitly described conceptual framework, by which we mean, in its broadest sense:
a specification of which parts of the world are being represented;
how they are represented;
the nature of the relationship between the represented and the representing, and
appropriate and rigorous strategies for applying the results of the analysis that is carried out in this representational framework.
It would be possible to construct such a specification from scratch, in a piecemeal fashion, for each new project, but it is worth noting that there are some overarching modeling frameworks that are broadly applicable to many different phenomena, which can then be moulded to fit these more specific instances.
7.2.1 Three Modeling Strategies
We suggest that there are three main not mutually exclusive modeling strategies that can be used to guide the specification of a phenomenon or domain:
computer modeling, and
We start with a description of the latter as it requires, in its simplest form, no special knowledge of techniques/concepts from mathematics or computer science.
General Systems Theory was initially put forward by L. von Bertalanffy, a biologist, who felt that it should be possible to describe many disparate natural phenomena using a common conceptual framework – one which would be capable of describing many disparate phenomena, all as systems of interacting objects.
Although Bertalanffy himself presented abstracted, mathematical, descriptions of his general systems concepts, his broad strategy is relatively easily translated into a purely conceptual framework.
Within this framework, when presented with a novel domain or situation, we ask ourselves the following questions:
which objects seem most relevant or involved in the system behaviours in which we are most interested?
what are the properties of these objects?
what are the behaviours (or actions) of these objects?
what are the relationships between these objects?
how do the relationships between objects influence their properties and behaviours?
As we find the answers to these questions about the system of interest, we start to develop a sense that we understand the system and its relevant behaviours.
By making this knowledge explicit, e.g. via diagrams and descriptions, and by sharing it amongst those with whom we are working, we can further develop a consistent, shared understanding of the system with which we are engaged. If this activity is carried out prior to data collection, it can ensure that the right data is collected.
If this activity is carried out after data collection, it can ensure that the process of interpreting what the data represents and how the latter should be used going forward is on solid footing.
Mathematical and Computer Modeling
The other modeling approaches arguably come with their own general frameworks for interpreting and representing real-world phenomena and situations, separate from, but still compatible with, this systems perspective.
These disciplines have developed their own mathematical/digital (logical) worlds that are distinct from the tangible, physical world studied by chemists, biologists, and so on. These frameworks can then be used to describe real-world phenomena by drawing parallels between the properties of objects in these different worlds and reasoning via these parallels.
Why these constructed worlds and the conceptual frameworks they provide are so effective at representing and describing the actual world, and thus allowing us to understand and manipulate it, is more of a philosophical question than a pragmatic one.
We will only note that they are highly effective at doing so, which provides the impetus and motivation to learn more about how these worlds operate, and how, in turn, they can provide data scientists with a means to engage with domains and systems through a powerful, rigorous and shared conceptual framework.
7.2.2 Information Gathering
The importance of achieving contextual understanding of a dataset cannot be over-emphasized. In the abstract we have suggested that this context can be gained by using conceptual frameworks. But more concretely, how does this understanding come about?
It can be reached through:
interviews with subject matter experts (SMEs);
data exploration (even just trying to obtain or gain access to the data can prove a major pain),
In general, clients or stakeholders are not a uniform entity – it is even conceivable that client data specialists and SMEs will resent the involvement of analysts (external and/or internal).
Thankfully, this stage of the process provides analysts and consultants the opportunity to show that everyone is pulling in the same direction, by
asking meaningful questions;
taking an interest in the SMEs’/clients’ experiences, and
acknowledging everyone’s ability to contribute.
A little tact goes a long way when it comes to information gathering.
Thinking in Systems Terms
We have already noted that a system is made up of objects with properties that potentially change over time. Within the system we perceive actions and evolving properties, leading us to think in terms of processes.
To put it another way, in order to understand how various aspects of the world interact with one another, we need to carve out chunks corresponding to the aspects and define their boundaries. Working with other intelligences requires this type of shared understanding of what is being studied. Objects themselves have various properties.
Natural processes generate (or destroy) objects, and may change the properties of these objects over time. We observe, quantify, and record particular values of these properties at particular points in time.
This process generates data points in our attempt to capture the underlying reality to some acceptable degree of accuracy and error, but it remains crucial for data analysts and data scientists to remember that even the best system model only ever provides an approximation of the situation under analysis; with some luck, experience, and foresight, these approximations might turn out to be valid.
Identifying Gaps in Knowledge
A gap in knowledge is identified when we realize that what we thought we knew about a system proves incomplete (or blatantly false).
This can arise as the result of a certain naı̈veté vis-à-vis the situation being modeled, but it can also be emblematic of the nature of the project under consideration: with too many moving parts and grandiose objectives, there cannot help but be knowledge gaps.120
Knowledge gaps might occur repeatedly, at any moment in the process:
even during communication of the results (!).
When faced with such a gap, the best approach is to be flexible: go back, ask questions, and modify the system representation as often as is necessary. For obvious reasons, it is preferable to catch these gaps early on in the process.
Consider the following situation: you are away on business and you forgot to hand in a very important (and urgently required) architectural drawing to your supervisor before leaving. Your office will send an intern to pick it up in your living space. How would you explain to them, by phone, how to find the document?
If the intern has previously been in your living space, if their living space is comparable to yours, or if your spouse is at home, the process may be sped up considerably, but with somebody for whom the space is new (or someone with a visual impairment, say), it is easy to see how things could get complicated.
But time is of the essence – you and the intern need to get the job done correctly as quickly as possible. What is your strategy?
Conceptual models are built using methodical investigation tools:
Data analysts and data scientists should beware implicit conceptual models – they go hand-in-hand with knowledge gaps.
In our opinion, it is preferable to err on the side of “too much conceptual modeling” than the alternative (although, at some point we have to remember that every modeling exercise is wrong121 and that there is nothing wrong with building better models in an iterative manner, over the bones of previously-discarded simpler models).
Roughly speaking, a conceptual model is a model that is not implemented as a scale-model or computer code, but one which exists only conceptually, often in the form of a diagram or verbal description of a system – boxes and arrows, mind maps, lists, definitions (see Figures 7.2 and 7.3).
Conceptual models do not necessarily attempt to capture specific behaviours, but they emphasize the possible states of the system: the focus is on object types, not on specific instances, with abstraction as the ultimate objective.
Conceptual modeling is not an exact science – it is more about making internal conceptual models explicit and tangible, and providing data analysis teams with the opportunity to examine and explore their ideas and assumptions. Attempts to formalize the concept include (see Figure 7.4):
Universal Modeling Language (UML);
Entity Relationship Models (ER), generally connected to relational databases.
In practice, we must first select a system for the task at hand, then generate a conceptual model that encompasses:
relevant and key objects (abstract or concrete);
properties of these objects, and their values;
relationships between objects (part-whole, is-a, object-specific, one-to-many), and
relationships between properties across instances of an object type.
A simplistic example describing a supposed relationship between a presumed cause (hours of study) and a presumed effect (test score) is shown below:
Relating the Data to the System
From a pragmatic perspective, stakeholders and analysts alike need to know if the data which has been collected and analyzed will be useful to understand the system.
This question can best be answered if we understand:
how the data is collected;
the approximate nature of both data and system, and
what the data represents (observations and features).
Is the combination of system and data sufficient to understand the aspects of the world under consideration? Once again, this is difficult to answer in practice.
Contextual knowledge can help, but if the data, the system, and the world are out of alignment, any data insight drawn from mathematical, ontological, programmatical, or data models of the situation might ultimately prove useless.
7.2.3 Cognitive Biases
Adding to the challenge of building good conceptual models and using these to interpret the data is the fact that we are all vulnerable to a vast array of cognitive biases, which influence both how we construct our models and how we look for patterns in the data.
These biases are difficult to detect in the spur of the moment, but being aware of them, making a conscious effort to identify them, and setting up a clear and pre-defined set of thresholds and strategies for analysis will help reduce their negative impact. Here is a sample of such biases (taken from , ).
- Anchoring Bias
causes us to rely too heavily on the first piece of information we are given about a topic; in a salary negotiation, for instance, whoever makes the first offer establishes a range of reasonable possibilities in both parties’ minds.
- Availability Heuristic
describes our tendency to use information that comes to mind quickly and easily when making decisions about the future; someone might argue that climate change is a hoax because the weather in their neck of the woods has not (yet!) changed.
- Bandwagon Effect
refers to our habit of adopting certain behaviours or beliefs because many others do the same; if all analyses conducted until now have shown no association between factors \(X\) and \(Y\), we might forego testing for the association in a new dataset.
- Choice-supporting Bias
causes us to view our actions in a positive light, even if they are flawed; we are more likely to sweep anomalous or odd results under the carpet when they arise from our own analyses.
- Clustering Illusion
refers to our tendency to see patterns in random events; if a die has rolled five 3’s in a row, we might conclude that the next throw is more (or less) likely to come up a 3 (gambling fallacy).
- Confirmation Bias
describes our tendency to notice, focus on, and give greater credence to evidence that fits with our existing beliefs; gaffes made by politicians you oppose reinforces your dislike.
- Conservation Bias
occurs when we favour prior evidence over new information; it might be difficult to accept that there is an association between factors \(X\) and \(Y\) if none had been found in the past.
- Ostrich Effect
describes how people often avoid negative information, including feedback that could help them monitor their goal progress; a professor might chose to not consult their teaching evaluations, for whatever reason.
- Outcome Bias
refers to our tendency to judge a decision on the outcome, rather than on why it was made; the fact that analysts gave Clinton an 80% chance of winning the 2016 U.S.Presidential Election does not mean that the forecasts were wrong.
causes us to take greater risks in our daily lives; experts are particularly prone to this, as they are more convinced that they are right.
- Pro-innovation Bias
occurs when proponents of a technology overvalue its usefulness and undervalue its limitations; in the end, Big Data is not going to solve all of our problems.
- Recency Bias
occurs when we favour new information over prior evidence; investors tend to view today’s market as the “forever’ market and make poor decisions as a result.
- Salience Bias
describes our tendency to focus on items or information that are more noteworthy while ignoring those that do not grab our attention; you might be more worried about dying in a plane crash than in a car crash, even though the latter occurs more frequently than the former.
- Survivorship Bias
is a cognitive shortcut that occurs when a visible successful subgroup is mistaken as an entire group, due to the failure subgroup not being visible; when trying to get the full data picture, it helps to know what observations did not make it into the dataset.
- Zero-Risk Bias
relates to our preference for absolute certainty; we tend to opt for situations where we can completely eliminate risk, seeking solace in the figure of 0%, over alternatives that may actually offer greater risk reduction.
Other biases impact our ability to make informed decisions:
base rate fallacy, bounded rationality, category size bias, commitment bias, Dunning-Kruger effect, framing effect, hot-hand fallacy, IKEA effect, illusion of explanatory depth, illusion of validity, illusory correlations, look elsewhere effect, optimism effect, planning fallacy, representative heuristic, response bias, selective perception, stereotyping, etc. , .