7.1 Introduction

Let’s start by talking about data.

7.1.1 What Is Data?

It is surprisingly difficult to give a clear-cut definition of data – we cannot even seem to agree on whether it should be used in the singular or the plural:

“the data is …” vs. “the data are …”

From a strictly linguistic point of view, a datum (borrowed from Latin) is “a piece of information;” data, then, should mean “pieces of information.” We can also think of it as a collection of “pieces of information”, and we would then use data to represent the whole (being potentially greater than the sum of its parts) or simply the idealized concept.

When it comes to actual data analysis, however, is the distinction really that important? Is it even clear what data is, from the definition above, and where it comes from? Is the following data?

\[4,529\quad \text{red}\quad 25.782\quad Y\]

To paraphrase U.S. Justice Potter Stewart, while it may be hard to define what data is, “we know it when we see it.” This position may strike some of you as unsatisfying; to overcome this (sensible) objection, we will think of data simply as a collection of facts about objects and their attributes.

For instance, consider the apple and the sandwich below:

An appleA sandwich

Figure 7.1: An apple and a sandwich

Let us say that they have the following attributes:

  • Object: apple

    • Shape: spherical

    • Colour: red

    • Function: food

    • Location: fridge

    • Owner: Jen

  • Object: sandwich

    • Shape: rectangle

    • Colour: brown

    • Function: food

    • Location: office

    • Owner: Pat

As long as we remember that a person or an object is not simply the sum of its attributes, this rough definition should not be too problematic. Note, however, that there remains some ambiguity when it comes to measuring (and recording) the attributes.

We dare say that no one has ever beheld an apple quite like the one shown above: for starters, it is a 2-dimensional representation of a 3-dimensional object. Additionally, while the overall shape of the sandwich is vaguely rectangular (as seen from above, say), it is not an exact rectangle. While no one would seriously dispute the shape attribute of the sandwich being recorded as “rectangle”, a measurement error has occurred.

For most analytical purposes, this error may not be significant, but it is impossible to dismiss it as such for all tasks.

More problematic might be the fact that the apple’s shape attribute is given in terms of a volume, whereas the sandwich’s is recorded as an area; the measurement types are incompatible. Similar remarks can be made about all the attributes – the function of an apple may be “food” from Jen’s perspective, but from the point of view of an apple tree, that is emphatically not the case; the sandwich is definitely not uniformly “brown,” and so on.

Furthermore, there are a number of potential attributes that are not even mentioned: size, weight, time, etc. Measurement errors and incomplete lists are always part of the picture, but most people would recognize that the collection of attributes does provide a reasonable description of the objects. This is the pragmatic definition of data that we will use throughout.

7.1.2 From Objects and Attributes to Datasets

Raw data may exist in any format; we will reserve the term dataset to represent a collection of data that could conceivably be fed into algorithms for analytical purposes.

Often, these appear in a table format, with rows and columns;118 attributes are the fields (or columns) in such a dataset; objects are instances (or rows).

Objects are then described by their feature vector – the collection of attributes associated with value(s) of interest. The feature vector for a given observation is also know as the observation’s signature. For instance, the dataset of physical objects could contain the following items:

ID shape colour function location owner
1 spherical red food fridge Jen
2 rectangle brown food office Pat
3 round white tell time lounge school

We will revisit these notions in Structuring and Organizing Data.

7.1.3 Data in the News

We collected a sample of headlines and article titles showcasing the growing role of data science (DS), machine learning (ML), and artificial/augmented intelligence (AI) in different domains of society.

While these demonstrate some of the functionality/capabilities of DS/ML/AI technologies, it is important to remain aware that new technologies are always accompanied by emerging (and not always positive) social consequences.

  • “Robots are better than doctors at diagnosing some cancers, major study finds” [76]

  • “Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet” [77]

  • “Google AI claims 99% accuracy in metastatic breast cancer detection” [78]

  • “Data scientists find connections between birth month and health” [79]

  • “Scientists using GPS tracking on endangered Dhole wild dogs” [80]

  • “These AI-invented paint color names are so bad they’re good” [81]

  • “We tried teaching an AI to write Christmas movie plots. Hilarity ensued. Eventually.” [82]

  • “Math model determines who wrote Beatles’ "In My Life": Lennon or McCartney?” [83]

  • “Scientists use Instagram data to forecast top models at New York Fashion Week” [84]

  • “How big data will solve your email problem” [85]

  • “Artificial intelligence better than physicists at designing quantum science experiments” [86]

  • “This researcher studied 400,000 knitters and discovered what turns a hobby into a business” [87]

  • “Wait, have we really wiped out 60% of animals?” [88]

  • “Amazon scraps secret AI recruiting tool that showed bias against women” [89]

  • “Facebook documents seized by MPs investigating privacy breach” [90]

  • “Firm led by Google veterans uses A.I. to ‘nudge’ workers toward happiness” [91]

  • “At Netflix, who wins when it’s Hollywood vs.the algorithm?” [92]

  • “AlphaGo vanquishes world’s top Go player, marking A.I.’s superiority over human mind” [93]

  • “An AI-written novella almost won a literary prize” [94]

  • “Elon Musk: Artificial intelligence may spark World War III” [95]

  • “A.I. hype has peaked so what’s next?” [96]

 

  Opinions on the topic are varied – to some, DS/ML/AI provide examples of brilliant successes, while to others it is the dangerous failures that are at the forefront.

What do you think?

7.1.4 The Analog/Digital Data Dichotomy

Humans have been collecting data for a long time. In the award-winning Against the Grain: A Deep History of the Earliest States, J.C. Scott argues that data collection was a major enabler of the modern nation-state (he also argues that this was not necessarily beneficial to humanity at large, but this is another matter altogether) [97].

For most of the history of data collection, humans were living in what might best be called the analogue world – a world where our understanding was grounded in a continuous experience of physical reality.

Nonetheless, even in the absence of computers, our data collection activities were, arguably, the first steps taken towards a different strategy for understanding and interacting with the world. Data, by its very nature, leads us to conceptualize the world in a way that is, in some sense, more discrete than continuous.

By translating our experiences and observations into numbers and categories, we re-conceptualize the world into one with sharper and more definable boundaries than our raw experience might otherwise suggest. Fast-forward to the modern world and the culmination of this conceptual discretization strategy is clear to see in our adoption of the digital computer, which represents everything as a series of 1s and 0s.119

Somewhat surprisingly, this very minimalist representational strategy has been wildly successful at representing our physical world, arguably beyond our most ambitious dreams, and we find ourselves now at a point where what we might call the digital world is taking on a reality as pervasive and important as the physical one.

Clearly, this digital world is built on top of the physical world, but very importantly, the two do not operate under the same set of rules:

  • in the physical world, the default is to forget; in the digital world, the default is to remember;

  • in the physical world, the default is private; in the digital world, the default is public;

  • in the physical world, copying is hard; in the digital world, copying is easy.

As a result of these different rules of operation, the digital is making things that were once hidden, visible; once veiled, transparent. Considering data science in light of this new digital world, we might suggest that data scientists are, in essence, scientists of the digital, in much the same way that regular scientists are scientists of the physical: data scientists seek to discover the fundamental principles of data and understand the ways in which these fundamental principles manifest themselves in different digital phenomena.

Ultimately, however, data and the digital world are tied to the physical world. Consequently, what is done with data has repercussions in the physical world; and it is crucial for analysts and consultants to have a solid grasp of the fundamentals and context of data work before leaping into the tools and techniques that drive it forward.

References

[76]
L. Donnelly, “Robots are better than doctors at diagnosing some cancers, major study finds,” The Telegraph, May 2018.
[77]
P. A. B. Bien Nicholas AND Rajpurkar, “Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet,” PLOS Medicine, vol. 15, no. 11, pp. 1–19, 2018, doi: 10.1371/journal.pmed.1002699.
[78]
[79]
Columbia University Irving Medical Center, Data scientists find connections between birth month and health,” Newswire.com, Jun. 2015.
[80]
[81]
S. Reichman, “These AI-invented paint color names are so bad, they’re good,” Curbed, May 2017.
[82]
[83]
[84]
Indiana University, Scientists use Instagram data to forecast top models at New York Fashion Week,” Science Daily, Sep. 2015.
[85]
J. Hiner, How big data will solve your email problem,” ZDNet, Oct. 2013.
[86]
[87]
[88]
E. Yong, Wait, have we really wiped out 60% of animals? The Atlantic, Oct. 2018.
[89]
[90]
[91]
[92]
S. Ramachandran and J. Flint, At Netflix, who wins when it’s Hollywood vs. The algorithm? Wall Street Journal, Nov. 2018.
[93]
[94]
D. Lewis, An AI-written novella almost won a literary prize,” Smithsonian Magazine, Mar. 2016.
[95]
[96]
T. Rikert, A.I. hype has peaked so what’s next? TechCrunch, Sep. 2017.
[97]
J. C. Scott, Against the grain: A deep history of the earliest states. New Haven: Yale University Press, 2017.