1.1 Programming Fundamentals

Have you ever taken the time to ask yourself what computer code and computer programs are? Is there a difference between these two concepts? (see [8] for an interesting discussion on the topic).

In a nutshell, a computer program is an algorithm, written in a computer language, providing instructions to a computer for carrying out a series of operations.

An example of a computer program is provided in Figure 1.1.

An example of a computer program written in the computer language C. What do you suppose this program does? [[Programiz](https://www.programiz.com/c-programming/examples/swapping)]

Figure 1.1: An example of a computer program written in the computer language C. What do you suppose this program does? [Programiz]

Computer programs can be compiled or interpreted as a series of hardware operations, carried out by the electrical components of a computer.

1.1.1 Compiled vs. Interpreted Languages

Compilers translate full source code programs, written in high-level language (i.e., using natural languages, only “understandable” by people, as in Figure 1.1) into machine language (i.e., binary code, only “understandable” by computers): they are basically grammatical (syntactic) checkers – if the source code is error-free, it is converted into machine code, which is eventually run by an executable file. Compiled code runs quickly, and is thus favoured for deployment phase. Commonly-used compiled languages include C/C++/C#, COBOL, Fortran, Pascal, and Julia.

Interpreters execute the source code directly: as long as an individual statement is error-free (in the context of the available workspace), it can be executed every time it is called, without regard for the overall syntax of the file. Interpreters are slower, generally, and are favoured during the development phase. Commonly-used interpreted languages include R,4 Python, JavaScript, and Ruby.

1.1.2 Some Fundamental Concepts

We have been using the terms “computer language” and “algorithm” as though they were everyday words. Let us take the time to ensure that their meanings are clear.

Formal Language

In a formal language, words are created by combining letters from a pre-defined alphabet, according to the rules provided by a formal grammar.

Everything that is formed according to the rules is an acceptable word; anything else is not.

Example: Consider the formal language defined with

  • alphabet: {a, b, C, D,!}

  • grammatical rules:

    1. letters may only be placed immediately to the left or to the right of another letter

    2. a letter instance must always be accompanied by another instance of the same letter at some location either to its left and/or to its right (or both)

    3. an upper case letter must always be accompanied by a lower case letter immediately to its left or to its right

Thus, aa is a word in this formal language (rules 2 and 3 are clearly satisfied; rule 3 is satisfied vacuously), as is bCaCab, but !aC!, DDaa, and Patrick are not (why?).

Formal languages can sometimes seem ridiculous – of course letters may only be placed to the left or to the right of other letters… where else would they go? Well, rule 1 officially (and formally) eliminates letters piling up on top of one another, for starter, but also spaces between words (for that language, the space _ is not in the alphabet of letters).

Human languages, of the sort deemed natural (in contrast with artificial or constructed languages) are formal, in theory. In practice, their grammars tend to be flexible (more so with English than French, say) – syntax evolves with cultures (in time and in space), and semantics (meaning) can be retained even when the grammar is mangled … but only up to a point.

Computer Language

Computer languages are languages constructed to provide instructions to a computer, in such a way that they can be compiled into low-level instructions that the computer processor can execute.

Computer languages are also called programming languages, for reasons that will soon become obvious.

They are formal languages because if the grammatical rules are not followed to the letter, the program cannot be executed – computers cannot guess or infer what the programmer really meant when the syntax is out of sorts.

The structure of the formal definition of a computer language contains the following sections:

  1. Lexical Rules

  2. Syntax Rules

    • Grammar Productions

    • Operator Associativities and Precedences

  3. Typing Rules

    • Declarations

    • Type Consistency Requirements (Function Definitions, Expressions, Statements)

  4. Operational Characteristics

    • Data (Scalars, String Constants, Arrays)

    • Expressions (Order of Evaluation, Type Conversion, Array Indexing)

    • Assignment Statements (Order of Evaluation, Type Conversion)

    • Functions (Evaluation of Actuals, Parameter Passing, Return From a Function)

  5. Program Execution

As an illustration, the lexical rules of C are shown below:

Lexical rules of the programming language C [[Debray](https://www2.cs.arizona.edu/~debray/Teaching/CSc453/DOCS/cminusminusspec.html)].

Figure 1.2: Lexical rules of the programming language C [Debray].


Computer programs are algorithms, which is to say, sequences of instructions with (at least) one well-defined stopping point (an instruction that tells the program when to stop running).

Algorithms are not always mathematical or computer-based. In some sense, we could think of recipes as algorithms as well: the baking/cooking steps are presented in sequence, and some last step that must be completed before the end product can be eaten.

For instance, here is am algorithm to make muffins:

  1. Pour 1/2 cup of flour into a bowl

  2. Break one egg into the bowl

  3. Pour 3 tablespoons of oil into the bowl

  4. Pour 1 teaspoon of baking powder into the bowl

  5. Pour 1/4 cup of sugar into the bowl

  6. Mix with spoon until smooth

  7. Pour the mixture into muffin tins

  8. Bake for 15 minutes at 350 degrees Fahrenheit

  9. Let cool before eating

What is the stopping point? What is the outcome?

1.1.3 Code Components

Various sets of instructions, conventions, and structures are so fundamental to computer programming aims that they can be found in nearly all computer languages.

These fundamental code elements include:
  • Variables

  • Data Structures

  • Operators

  • Statements and Expressions

  • Blocks (and Scope)

  • Functions

  • Logical (Control) Flow

  • Libraries/Packages/Modules

  • Inputs/Outputs

  • Interpreters/Compilers

How these components mesh with one another depends on the syntax of the programming language under consideration (or its dialect).

In Figure 1.3, we see how this could be done in base R, for instance.

Computer code elements in action, for the scripting language `R`.

Figure 1.3: Computer code elements in action, for the scripting language R.

This particular chunk of code uses the

  • igraph library (specifically, its pre-compiled functions sample_gnp(), ecount(), V(), and plot())

and builds the

  • user-defined function my_graph_function() usin a code block,

which takes in as

  • inputs


  • variables my_number_nodes, my_colour, and my_density.

This function creates a

  • graph data structure my_graph

and colours the graph’s vertices using my_colour as long as

  • some conditional logic statement relating to the number of edges in the graphs and my_number_nodes is satisfied.

The function generates a visualization of the graph as an

  • output,

which is displayed when the function call is issued.

The code is seen in action below: it creates and displays a 30-node, green-coloured, non-directed, loop-free graph with probability 0.3 of there being an edge between two arbitrary nodes (see Network Data Analysis for an introduction to these concepts).


my_graph_function <- function(my_number_nodes, my_colour, my_density) {
  my_graph = sample_gnp(my_number_nodes, my_density, directed=FALSE, loops = FALSE)
  if(ecount(my_graph) >= my_number_nodes) {
    V(my_graph)$color <- my_colour    
  plot(my_graph, layout = layout.fruchterman.reingold, vertex.color = V(my_graph)$color)

All of this might look mysterious and opaque at first; it is important to remember that mastering a computer language requires time and practice.

Some languages are dialects or variants of other languages (or at least, the languages are mutually intelligible, like Swedish and Danish, say), and so being proficient in one can make it is easier to become proficient in another. Not all programming languages arise from the same paradigm, however: imperative languages (object-oriented programming) live in a different “linguistic family” than declarative languages (functional programming) languages.

1.1.4 Designing with Pseudo-Code

Before we can start thinking about writing code (in whatever programming language whose syntax we have mastered), we need to think about what it means to design an algorithm (or a computer program).

From a mathematical perspective, an algorithm is a (stochastic) function. We need to specify:

  • the algoritm’s inputs;

  • its outputs, and

  • the procedure to transform the inputs into the outputs.

In the muffin recipe above, the ingredients are the inputs, the muffins themselves are the outputs, and the recipe instructions describe the transformation.

It is good programming practice to avoid typing up programs on the fly – code needs to be planned: we need to know what the program will do and how it will go about doing it before we commit it to a file, independently of the language in which it will be implemented.

Pseudo-code is a term used to describe a rough sketch of the algorithm, which indicates its expected inputs, outputs, and steps, while leaving the specifics of its functionality in “black boxes”.

Pseudo-code is usually designed with the main elements of code (e.g. variables, functions, logical flow, etc.), in a language-agnostic (i.e., human readable) manner.

Example: we might be interested in building an algorithm that would cluster the observations in a dataset, using a maximum number of “local” observations (see Machine Learning 101 and Spotlight on Clustering for an in-depth discussion of these concepts).

What does the following chunk of pseudo-code (which is part of the bigger clustering picture) do?

find_neighbours(array_of_points, max_n_neighbour_distance)
    for each point[i] in array_of_points
        for each remaining point[j] in array_of_points
            distance_between_ij = distance(point[i], point[j])
            if distance_between_ij <= max_n_neighbour_distance
            then neighbours[i] = add_to_neighbours(point[i],point[j])

This is what is happening:

  • the algorithm j_cluster() takes as inputs a dataset array_of_points and a quantity max_n_neighbour_distance;

  • for each observation point[i] in the dataset (i indexes the observations), it considers all other observations point[j] and computes their distances to the initial observation point[i] (one by one);

  • when these distances are smaller than the input threshold max_n_neighbour_distance, it considers that the corresponding observation point[j] is a neighbour of observation point[i], and adds the former to the neighbours of observation point[i].

Evidently, this chunk of pseudo-code defines the neighbourhood of each observation in the dataset.

Note the black box functions distance() and add_to_neighbours(): their specifics are not provided (their eventual implementation may change depending on the computer language selected to write the program), but what they represent is clear. That is the power of pseudo-code.

Of course, in practice, we also do not sit down and write pseudo-code on the fly… that too must be planned (see Figure 1.4).

The first stage of pseudo-coding, in all its chicken scratch glory (file courtesy of J. Schellinck).

Figure 1.4: The first stage of pseudo-coding, in all its chicken scratch glory (file courtesy of J. Schellinck).

Getting a feel for the right level of pseudo code detail takes practice: should we drill down into what add_to_neighbours() does? Do we need to describe what <= does? How much utility should be sacrificed in favour of undertanding?

To some extent, answers to these questions depends on the level of abstraction of the programming language used to implement the algorithm:

  • high-level languages (such as R and Python) contain tons of built-in functions, which allow for programming at higher levels of abstraction, whereas

  • many details and functions must be programmed “by hand” in low-level languages (such as assembly and machine languages), which require lower levels.

The strategy to write useful pseudo-code is deceptively simple:

  1. define the available inputs;

  2. define the desired outputs, and

  3. identify (and write down) a set of programmatic instructions (procedure) to transform the inputs into the outputs.

This is easier said than done, obviously, and it looks an awful lot like the definition of an algorithm we gae previously, but remember that parts of the pseudo-code can be “black boxed”, which is to say, that functionality can be described at a high level.

  1. Write pseudo-code that will sort a list of numbers. Identify the inputs and the outputs, and solve the problem “procedurally” on a definite example before generalizing to a general list. You may need to “black box” the manipulation of individual numbers and group of numbers within the list.

  2. Write pseudo-code that will enumerate all strings of up to n characters taken from the set A-Z, with no repeated character. Identify the inputs and the outputs, and solve the problem “procedurally” on a definite example before generalizing. Use “black boxes” as needed.

1.1.5 From Pseudo-Code to Code that Runs

Once we are satisfied that the pseudo-code provides a decent path to solving the problem at hand,5, we can start thinking about how to implement it into real code (“code that runs”):

  1. we start by determining the appropriate syntax for the computer language that will be used and we re-write the pseudo-code as syntactically correct code in this language;

  2. we replace all “black box” functions with real code, and

  3. we determine how to connect the real code (the software) to the computer, so that it can be compiled/interpreted, and run by the computer (receiving inputs and generating outputs).

It might take multiple tries before this is done successfully. That is to be expected. It takes time, even for the most gifted programmer, to become an expert in a new language. The urge to feel defeated if (when?) the first few attempts fail is completely natural; as always, practice is the answer.

The process of taking the high-level code (which is really a text file) and getting it to run on a computer without a hitch requires a certain amount of infrastructure to be in place:

  • libraries

  • input/output + file system

  • compilers/interpreters

In this book, we are taking care of much of these issues by setting up the R/Python examples internally and running them locally (using our infrastructure); this works well for illustrating the concepts and working with pedagogical datasets, but the infrastructure conundrum must be tackled and solve before it becomes possible to produce useful and actionable data analysis results (see Data Engineering, Data Management, Reporting, and Deployment for more details).

In general, there is no single authoritative reference manual describing how to use a particular computer language and/or how to make code run on particular hardware configurations, in no small part because coding and computer references become obsolete in the blink of an eye.6

Successfully coders must be embedded in a community of coders. Luckily, this is easier to do everyday – most questions anybody could ever have about specific aspects of coding have already been answered somewhere online. Stack Exchange and similar sites can be quite useful in that regard.7

As a last remark on the topic, keep in mind that in the real coding world, there is no such thing as cheating: the objectives are to make happen the things you want to see happen. Getting help along the way is emphatically not prohibited (mind you, it is good practice to cite or acknowledge such help).

1.1.6 Debugging

    99 little bugs in the code,
    99 bugs in the code,
    fix one bug, compile it again,
    141 little bugs in the code.
    141 little bugs in the code…..
        (Repeat until bugs = 0)

So mistakes WILL happen. What do we do about that?

In the development phase, coding is about getting all the moving pieces to fit together, yes, but it is also about fixing the bugs, an “error in the source code that causes a program to produce unexpected results or crash altogether” [Techslang]. Fixing these bugs (debugging) is mainly about revealing what is in memory at different points in the control flow of the code, to determine if it is actually doing what we think it ought to be doing.

As the quote at the start of the section implies, debugging is a bit of an art form, requiring the programmer to become a detective and a zen master [The Tao of Programming]. It teaches perseverance and humility, and it really helps us perfect our understanding of the language, of the code itself, and of the task at hand.

Debugging tools can help with all of this; at our level, debugging often requires running the code line-by-line until we can identify the chunk of code that is the culprit.

1.1.7 R/Python

But there is only so much we can say about programming in general; at some point, we need to select a computer language and get going in earnest.

At a foundational level, most programming languages are roughly equivalent (Turing-complete or Turing-equivalent), in the sense that anything that can be done with one can also (more or less) be done with another. But that does not mean that they are all equally useful.

Some are better suited to certain tasks, whether because they are less memory-intensive, or more elegant, or more intuitive, etc.

Even in the data analysis world, there are competing paradigms. In this book, we will introduce and use two of the most popular languages (although by no mean the only popular ones): R and Python.

In the examples we provide, R code appears in blue boxes:

... some R code ...

Whereas Python code appears in green boxes:

... some Python code ...

Object-Oriented Languages vs. Procedural Languages

R and Python are objected-oriented languages, as opposed to procedural languages. What does that mean?

The focus of procedural programming is to break down a programming task into a collection of variables, data structures, and subroutines, whereas in object-oriented programming it is to break down a programming task into objects that expose behavior (methods) and data (members or attributes) using interfaces. The most important distinction is that while procedural programming uses procedures to operate on data structures, object-oriented programming bundles the two together, so an “object”, which is an instance of a class, operates on its “own” data structure. [9]

This will make more sense if we first understand the concepts of:

  • data types

  • data structures

  • functions

Languages have a set of built-in basic variable types, such as:

  • integer: 5

  • character: ‘m’

  • list: (5, 3, 9)

Other variables types can be built up out of these basic types, such as

  • strings, which are list of characters: (‘t’, ‘a’, ‘b’, ‘l’, ‘e’)

A user might also want to define their own set of related variables – a data structure:

  • struct myNames = {string firstName, string middleName, string lastName}

  • jenNames might be a variable of type myNames, with firstName = Jen, middleName = Adele, lastName = Schellinck.

In addition a programmer might want to be able to carry out a set of predefined instructions, or functions, on that data structure:

  • jenNames.print_middleName or

  • jenNames.string_length_lastName, say (what these functions do should be clear from their name).

Loosely speaking, an object is a user-defined data structure, together with a set of functions that are specific to that structure.

The data frame object in R is structured similarly to a spreadsheet:

  • it has rows and columns, with associated row and column names, and

  • we can carry out predefined operations (mean, count, etc.) on specific values, on selected rows, or selected columns, or the data frame as a whole.

Learners that are familiar with databases and/or languages that are more vector-focused (e.g. Java) might find the data frame implementation in R frustrating; those who are familiar with matrices and other mathematical concepts used in data analysis, less so.