About These Course Notes
The first thing to know about this book is that it isn’t really a “book”. It would make more sense to think of it as a reference manual and a source of examples and application.
We borrow some of its contents from authors who are simply better at explaining things than we are; we also sometimes modify their examples and code to better suit our pedagogical needs.1 Major influences include [1], [2], [3], [4], [5], and [6] – be sure to give these masterful works the attention they deserve!
The second thing to know about this book is that it isn’t really “a” book. It would make more sense to think of it as a bunch of books in a trench coat, masquerading as single one. No one is expected to traverse the book in one sitting, or even to tackle more than a few of its assigned modules/sections/subsections/exercises at any given time; rather, it is intended to be read in parallel with guided lectures.
The third thing to know about this book is that the practical examples use R
and/or Python
(for no particular reason other than that some programming language had to be used to illustrate the concepts). In the text, R
code appears in blue boxes:
Whereas Python
code appears in green boxes:
Now you may look at some piece of code and think to yourself: “This isn’t how I would do it” or “Such-and-such a task would be easier to accomplish if we used module/package ABC or programming language XYZ”. Perhaps. But finding the optimal tool is not the point of this book. In the first place, new data science tools appear regularly, and it would be a fool’s errand to try to continuously modify the book to keep up with them.2 In the second place, we are serious about the “Understanding” part of Data Understanding, Data Analysis, and Data Science, and that is why we favour a tool-agnostic approach.
The fourth thing to know about this book is that it is not a place to go to in order to obtain a detailed step-by-step guide on “how to solve it”. In person, our answer to a vast array of data science related questions is, rather anti-climatically: “it depends”. Of course, it depends; on the data, on the objectives, on the cost associated with making a mistake, on the stakeholder’s appetite for uncertainty, and, perhaps more surprisingly, on the analytical and preparation choices that are made along the way.
To some, this might smack of post-modernism: “you are saying that there is no truth, and that data analysis is pointless!” To which our response is: “analysts have agency (lots of it, as it turns out), and their choices DO influence the results… so run multiple analyses to determine the variability of the outcomes”. That is simply the nature of the discipline.
The last thing you should probably know about this book is that we have made a concerted effort to focus mainly on the story of (learning) data analysis and data science; sometimes, that comes at the expense of rigourous exposition.
“The early stages of education have to include a lot of lies-to-children, because early explanations have to be simple. However, we live in a complex world, and lies-to-children must eventually be replaced by more complex stories if they are not to become delayed-action genuine lies.” [7]
Some of the concepts and notions that we present are incomplete by design, but remain (we hope) true-to-their-spirit, or at least true “enough” for a first pass.3 Our position is that learning is an iterative process and that important take-aways from an early stage might need to be modified to account for new developments at a later date. But all things in good time.
Flexibility is a friend in your learning adventure; perfectionism, well… not always so.