1.5 Basics of Python

Python is another object-oriented language (OOL). It was created in the early 90’s but was not popularized until the 00’s. It lends itself to writing structured, easy-to-read computer code.

It is intended to be easier to understand and learn than other OOLs. One of its strength is that it has a massive base of open-source modules, which allow programmers to implement very sophisticated functionality simply by making a few function calls (not unlike R’s packages).

More information is available from the Python Software Foundation, on Stack Exchange (and similar sites), and in reference manuals, such as Jake VanderPlas’ A Whirlwind Tour of Python or the Python 3 documentation.

1.5.1 Integrated Development Environments for Python

For data science purposes, Anaconda and Jupyter are popular Python integrated development environments (IDE); Rodeo, Spyder, PyCharm, Ninja (an others) also provide RStudio-like functionality for Python. Installation instructions are available on the respective websites.

We will not explain how to install and set-up Python on your machine the way we did so for R and RStudio at this stage, although we will revisit this in Data Engineering and Data Management and Reporting and Deployment).

1.5.2 Introduction to Python

The content of this section (and the next one) is intended to help data analysts get a better sense of how Python could be used for data analysis. They are not designed to teach the ins and outs of Python programming. Instead, they illustrate typical tasks through examples.15


Let us start with the basics.

Using Python as a Scientific Calculator

Mathematical expressions can easily be evaluated numerically in Python. For scientific calculations, one should import the math module (package/library) which contains many mathematical functions.

It is important to note that Python also provides facilities for integer arithmetic which will be covered later. In this section, only floating-point calculations are used.

Modules can be imported using the import function.

import math

We can call pre-compiled functions in a module by prepending the module name (with a period) to the function name: module.function_name() is the Python equivalent of package::function_name() in R.

For instance, there is a cos function in the math module: it is called using math.cos().

We can evaluate \(\cos(\sqrt{\pi})\) with:

math.cos( math.sqrt( math.pi ) )

\(\arctan (2^5/3)\) with

math.atan( 2**5 / 3 )

and \(\ln(1+e^4)\) with

math.log( 1 + math.exp(4) )
Using Variables to Hold Intermediate Results

It could be helpful to break complex calculations into smaller steps. Variables can be used to store intermediate results. We will see later how variables are used in algorithmic settings.

For instance, we could break down the evaluation of \(\exp(\sin(\sqrt{2}+2))\) into three parts:

  • \(x=\sqrt{2}\)

  • \(y=\sin(x+2)\)

  • \(z = \exp(y)\)

x = math.sqrt(2)
y = math.sin(x+2)
z = math.exp(y)

In order to display the values taken by the variables, we must call on them separately, as below:

(1.4142135623730951, -0.26925647329402774, 0.7639472984402832)

The variables are saved even when they are not displayed, however.

Numbers as Formatted Strings

Quite often, we may want to control the way numbers are displayed (this can come in handy when reporting results). For example, we may wish to display no more than 4 decimal places for all real numbers, or we may want to pad numbers with zeros so that they all have a given width.

The following block illustrates a number of ways to obtain formatted strings of the number 12.3456789. For more details on the format specification mini-language, please consult the documentation.

Note that a string must be enclosed within double quotes or single quotes. We will discuss general string operations shortly.

x = 12.3456789

We can format the number as a string of width 10, with 2 decimal places:

'     12.35'

or as a string with 4 decimal places:


or as a zero-padded string of width 5, with no decimal:

Fixed Decimals

Floating-point numbers are usually shunned as they are inherently inexact. For example, we might be bewildered to find out what the following sum amounts to:

2.2 + 1.1

the result 3.3000000000000003 is definitely not what we would expect as a sum, namely, 3.3.

The decimal module allows us to express decimal numbers exactly (see the documentation for more information).

Let’s look at a few examples of working with decimal and Decimal().

We start by defining x and y as the fixed decimal values 1.1 and 1.2, respectively. Note that the numbers must entered as strings.

import decimal

x = decimal.Decimal("1.1") 
y = decimal.Decimal("2.2") 

These computations behave as we would expect:

print( x+y )
print( y/x )
print( x**decimal.Decimal("3") ) 

If we do not enter the numbers as strings, they will be treated as floating-point numbers, and then be converted to a string, leading to unexpected results.

x = decimal.Decimal(1.1) 
y = decimal.Decimal(2.2)

print( x+y ) 

Rounding works as one would expect when variables are correctly declared as fixed decimals:

z = decimal.Decimal("3.1416")
round( z, 3 )

Once fixed decimals are used, we must use mathematical functions provided by the decimal module in order to stay within that module (unfortunately, trigonmetric functions are not available).

For instance, if

a= decimal.Decimal("0.16")


print( a.sqrt() )
print( a.ln() )  
print( a.log10())

The same results could be obtained using the math module functions:

import math 
print( math.sqrt(a) )
print( math.log(a) )  
print( math.log10(a) )
  1. Evaluate \(\lfloor 10001/4 \rfloor\) and \(\arcsin (\pi/4)\).

  2. Obtain the value of \(s\) in the following: \(a=\pi(1+\ln 5)\), \(b=\frac{1}{3+\sqrt{4}}\) and \(s=a+b\).

  3. Obtain a formatted string of \(\sin(\pi^2)\) of width 8, with 5 decimal places.

  4. Turn the value of \(\sqrt{3}\) into a fixed decimal with 8 decimal places.

List and Tuples

Lists and tuples are important objects in Python programming. Even though we will be mostly using numpy arrays and certain pandas objects instead of lists later on, it is useful to learn the basics of lists as some of the concepts are transferrable.

List Creation

A list holds a sequence of objects, who do not all have to be the same type.

One way to create a list is to enclose the elements, separated by commas, with square brackets.

Let us illustrate this concept with a simple list containing three objects.

x = [3,'a',5.1]  

We can extract the elements using indices (note that the first element corresponds to index 0, the second to index 1, etc.):


The type of each of the elements can be found below

print( type(x[0]) ) 
print( type(x[1]) ) 
print( type(x[2]) ) 
<class 'int'>
<class 'str'>
<class 'float'>

We can also “multiply” an element and transform it into a longer list:

['Ho', 'Ho', 'Ho', 'Ho', 'Ho', 'Ho', 'Ho', 'Ho', 'Ho', 'Ho']

or create a list of integers ranging from \(0\) to \(n-1\), or from \(a\) to \(b-1\):

n = 5

[0, 1, 2, 3, 4]
[3, 4, 5, 6]

Tuples are list-like objects, but with the following differences:

  • they are defined with parentheses instead of square brackets (sometimes, the parentheses can be omitted);

  • they are immutable (once created, they cannot be modified).

For instance, if

t = (1,'a',4.5)

then we can obtain the length of t and print its 2nd element using

print( len(t) ) 
print( t[1] ) 

but we cannot change the value of the third element of t or append a new value to t: both commands in the next block of code are illegal:


although the same command applied to the list x would be legal:

[3, 'a', 1, 5]

If we know the dimension of a tuple t, we can also use an extract pattern to extract the individual components, as the following examples illustrate.

t = (1, 'two', 3.0)

fst, snd, trd = t
print( fst, snd, trd )
1 two 3.0

We could use the _ (place holder) to extract solely the second component, say.

_, s, _ = t 

What do you think is happening below?

days = [(0,"Sun"), (1, "Mon"), (2, "Tue"), (3, "Wed"), (4, "Thu"), (5, "Fri"), (6, "Sat")]

for n, d in days:
    print(d+" is represented by " + str(n))
Sun is represented by 0
Mon is represented by 1
Tue is represented by 2
Wed is represented by 3
Thu is represented by 4
Fri is represented by 5
Sat is represented by 6
List Comprehension

List comprehension is a powerful way to create lists, based on set notation. Before we get into the technical details, let us look at some examples.

We start by importing solely the function sqrt() from the math module (doing so means that we will not require the prefix math. in order to invoke sqrt()); we also declare an index list x:

from math import sqrt 
x = [1, 4, 9, 16]
[1, 4, 9, 16]

We can now build new lists from x, such as the list of the squares of the elements of x:

y = [a**2 for a in x] 
[1, 16, 81, 256]

the list of the square roots of the elements of x greater than 4:

z = [sqrt(b) for b in x if (b > 4)] 
[3.0, 4.0]

or the list of integers from 0 to 9 (equivalent to range(10)):

u = [ c for c in range(10) ] 
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

The most basic form of list comprehension is [f(x) for x in l], where l is a list (or an iterable) and f(x) is an expression in x.

It creates a list obtained by applying f to each element or iterate in l.16

An optional conditional (we will discuss those shortly) can also be present, giving the general form [f(x) for x in l if g(x)], for some boolean expression g (taking on the values True or False) where generation of the list elements only applies to elements that satisfy the boolean expression.

Multiple lists or iterables can be specified in list comprehension. For example, the following creates a list of all possible tuples (x,y,z) such that x is True or False, y is from 4 to 6, and z is a string equal to either ‘math’ or ‘stat’.

[(x,y,z) for x in [True, False] for y in range(4,7) for z in ['math','stat']]
[(True, 4, 'math'), (True, 4, 'stat'), (True, 5, 'math'), (True, 5, 'stat'), (True, 6, 'math'), (True, 6, 'stat'), (False, 4, 'math'), (False, 4, 'stat'), (False, 5, 'math'), (False, 5, 'stat'), (False, 6, 'math'), (False, 6, 'stat')]

We can mimick list comprehension with the help of loops (to be discussed shortly), but this process is much less efficient. Whenever possible, it is preferable to use the former to generate lists.

List Operations

We illustrate various other operations that can be performed on lists in the blocks below; remember that list elements are zero-indexed (that is, the first element in the list has index 0):

  • sublisting

  • changing values

  • sorting values

  • appending values

  • concatenating lists

  • deleting elements

Consider a given list x:

x = [3,1,7,2,5]
[3, 1, 7, 2, 5]

We can find the length of the list (remember, ordinals start with 0, cardinals with 1):

print( len(x) ) 

or print the sublist from the second element to the fourth element, say:

print( x[1:4] ) 
[1, 7, 2]

We could also modify the second element of the list (index 1), say:

x[1] = 4 
[3, 4, 7, 2, 5]

Note that x is now permanently changed … or at least, until it is modified again; if we want to modify the last entry but we are not sure about the length of the list, for instance, we could use:

x[-1] = 6 
[3, 4, 7, 2, 6]

If we are looking to change the third last element as well, we could use

x[-3] = 1 
[3, 4, 1, 2, 6]

Finally, we could sort the resulting list:

[1, 2, 3, 4, 6]

A lot of Python methods are applied using the syntax object.method(), in contrast to the typical R syntax that would use method(object); so it is x.sort() instead of sort(x).

Let us create another list, this time with booleans:

y = [3, True, False] 
[3, True, False]

We can append a value, say 5, at the end of this list, as below:

[3, True, False, 5]

It is also possible to concatenate lists, using the (somewhat confusing) addition notation:

z = x + y
[1, 2, 3, 4, 6, 3, True, False, 5]

and delete the last element of this new list:

del z[-1] # Delete the last element from z
[1, 2, 3, 4, 6, 3, True, False]

or delete a range of elements, say from the 3rd to the 6th, from the resulting list:

del z[2:6] # watch out for the indices
[1, 2, True, False]
  1. Create a list of integers from −10 to 5.

  2. Use list comprehension to create a list (x,y) so that x+y > 8 where x can be any nonnegative integer at most 10 and y can be any positive integer at most 7.

  3. Use list comprehension to create a list (x,y) so that y is the square of x and x is from 1 to 10.

  4. Write one line of code that returns a list obtained from

x = ['one', 2, 3, 'four', 5, 6, 'seven', 8, 9, 10, 'eleven', 12, 13, 'fourteen']

by moving all the elements of type str to the end of the list. (Hint: Use list comprehension and concatenation. To check if a is of type str, use type(a) is str. To check if a is not of type str, use type(a) is not str.)

Flow Control

We will take a brief look at two ways to alter the flow of control in Python: conditional statements and loops.

Conditional Satements

Python supports if-elif-else statements in various forms.

In the following example, we let x be some random integer between 1 and 12 (using function randint() from module random) and see how the results are affected.

import random
x = random.randint(1,12)

Let us agree to print the string ’Helloifx` is less than 5, like so:

if x < 5:

Perhaps we want to print ‘Out of range’ if x is less than 5 or greater than 9, and Within range otherwise?

if x < 5 or x > 9:
    print('Out of range')
    print('Within range')
Within range

Finally, we might want to print ‘Small’ if x is positive and less than 5; otherwise, print ‘Five’ if x is 5; otherwise, print ‘Six’ if x is 6; otherwise, print +:

if 0 < x and x < 5:
elif x == 5:
elif x == 6:

Run this sequence of blocks a number of times to see the various outcomes.

Important: Note that the code block that follows an if, else, or elif statement must be properly indented. The custom is to use four spaces for indentation. The following example illustrates the effects of different indentations.

x = 4

if x < 5:
    print('This string will not be printed, because the else statement never triggers')
    print('Neither will this, for the same reason')
print('This will be printed no matter what x is, as it falls outside the if-else statement block')
This will be printed no matter what x is, as it falls outside the if-else statement block

Loops are useful for repeatedly executing a statement or a block. We first consider the for loop.

Let us start with a simple example: for each value in the list [1,3,8], we print its square.

for i in [1,3,8]:

We could also compute sums with loops, such as 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9:

sum = 0
for x in range(1,10):
    sum += x  # add the value of x to sum

Or print the first n even nonnegative integers

n = 5
for n in range(0,n):
    t = 2*n

If a for loop is used to create a list, it is probably best to rewrite it using list comprehension. The following time comparison (using %%timeit) illustrates the contrast when building a list of \(100 \times 1000\) items.

Using a loop:

l = []
for i in range(100):
    for j in range(1000):

Using list comprehension:

l = [ (i,j) for i in range(100) for j in range(1000)]

While loops are useful for iterating until a certain condition is met. For instance, if we want to print the first 10 even positive integers, separated by a space, we could use the following block:

i = 0
while i < 10:  # Repeat the following block until i reaches 10 or greater
    i += 1     # iterated index
    print(2*i, end=' ')
2 4 6 8 10 12 14 16 18 20 

Or we could print the 26 lower case English alphabets letters on one line, with no separation:

i = 0;
while i < 26:
    print(chr(ord('a')+i), end='') 
    i += 1

Note that ord returns the ordinal for a character; chr does the reverse.

  1. Write an if statement that prints odd if x is odd and prints even if x is even where x is a random integer between -100 and 100, inclusive.
import random
x = random.randint(-100,100)

(Hint: x % n returns the remainder of x divided by n).

  1. Use a single while loop to print all pairs (x,y) such that x+y=100 and x ranges from 0 to 50.


A function is a grouped sequence of code that can be called, such as cos() and print(). A function can have 0 or more arguments: cos() takes one argument, whereas print() can have up to five (see documentation for details).

Named Functions

Functions facilitate code re-use. Python functions are defined via the def statement. In the example below, we define a function that returns a pair consisting of the sum and the product of its arguments.

def sumprod(x, y):
    return x+y, x*y  

The parentheses around the tuple are optional in this context. The ouput for \(x=3\) and \(y=4\) can be obtained as below (once the function is compiled):

(7, 12)

Functions can also have default argument values. In the following example, if the second argument is not supplied, it takes on the value 5.

def myIntegerList(start, end=5):
    return list(range(start, end+1))

Compare the results of the two calls below:

[2, 3, 4, 5]
[7, 8, 9]
Anonymous (Lambda) Functions

Another way to define a function is with a lambda statement. This approach is mostly used to define one-line functions.17

Anonymous functions are defined using the one-line notation:

lambda variables: output

For instance,

add = lambda u, v: u + v
multiply = lambda u, v: u*v

We can apply a bivariate function func to arguments x and y, in a general context, using:

def applyFunc(func, x, y):
    return func(x,y)

and apply in specific contexts (rule, inputs) as follows:

print(applyFunc(multiply, 3,4))
print(applyFunc(add, 7,20))

But we do not need to define the function prior to the call. This would also work:

print(applyFunc(lambda u, v: u*v, 3,4))
print(applyFunc(lambda u, v: u + v, 7,20))
  1. Write a function myFunc() that returns the square of x if x is of type int and returns None otherwise (hint: type(x) is int is the syntax for testing if x is of type int).
def myFunc(x):
    res = None
    ## Your code here
    return res

Verify that the function behaves as expected:

assert( myFunc(5) == 25 )
assert( myFunc('five') is None )
  1. Write a function mySoS() that accepts a list of floats as the only argument and returns the sum of squares of the numbers (assume that the argument is indeed a list of floats – no need to test if the condition is met).
def mySoS(ns):
    res = 0
    ## Your code here
    return res

Verify that the function behaves as expected:

assert( mySoS([1.0,2.0,3.0]) == 14.0 )
assert( mySoS([-2.5,1.3,13.4]) == 187.5 )
  1. What is the result of the following code?
def mystery(func, n):
    return [ func(i) for i in range(n) ]

print(mystery( lambda x: (2*x+1)**2, 5 ))

Rewrite the function using an anonymous function (a single line of code).


String (text) manipulation is an important part of data cleaning. Often, the raw data contains string fields that do not quite follow an expected format. For example, proper nouns could be incorrectly capitalized. Dates could have been entered under different conventions. Fortunately, Python offers many tools that make string manipulation rather painless. In this section, we look at some of the commonly-performed operations on strings.

Strings can be defined using single or double quotes; note that Python supports unicode strings.

a = 'First string'

b = "Second string"

c = '北京'

print(type(a), type(b), type(c))
<class 'str'> <class 'str'> <class 'str'>

We can use the multiplication syntax to define a string made up of identical copies of another string as illustrated below:

r1 = a*10
r2 = c*3

First stringFirst stringFirst stringFirst stringFirst stringFirst stringFirst stringFirst stringFirst stringFirst string

Strings can be concatenated using the addition syntax:

d = a + c
e = r2 + a + b

First string北京
北京北京北京First stringSecond string

The character in position i (the index) of the string a can be accessed via a[i]. Remember that the first character’s index is 0.

Negative indices can also be used:a[-4] returns the fourth character from the end, say. For instance, we can print the first, seventh, last, and fourth-last characters of a using:

print(a[0], a[6], a[-1], a[-4])
F s g r

We can obtain a substring of a string a using the syntax a[i:j] where i specifies the starting index and j-1 the ending index. Note that a[:j] is equivalent to a[0:j], and a[i:] is the substring starting at index i and reaching until the end of a.


For a string x, x.split() splits the string into a list of words separated by a space (by default). Note that a contiguous sequence of space characters including newline (\n), carriage return (\r), and tab (\t) is considered as one space.

We can also specify what separating characters to use for the splitting, instead of spaces. For example, x.split(',') splits x on commas and x.split('--') splits it on --.

Consider the examples below:

print('This is  a  \n\n   long   sentence with  \r \t weird spaces separating the words.'.split())
['This', 'is', 'a', 'long', 'sentence', 'with', 'weird', 'spaces', 'separating', 'the', 'words.']
print('One,two, three ,four'.split(',')) # Note that ` three ` is one of the words after separation.
['One', 'two', ' three ', 'four']
['Five', 'six', 'ninety-four']

In some case, it is helpful to remove leading and trailing space characters (whitespace stripping).

s = '  time   '

It is common to combine strip() with split(','):

cs = 'One   , two,  three  '
print([s.strip() for s in cs.split(',')])
['One', 'two', 'three']

In fact, the strip() method can accept a string consisting of all characters to be stripped from anothe string, in any combination. For instance, we can strip any leading and trailing characters contained in ['&','#','-','.','!'] from any string as follows:

tostrip = '&#-.!'
t = '###.Hel#lo!?!&-'


The methods upper(), lower(), and title() are useful for altering the case of characters in a string. The following examples showcase their functionality.

x = "gArbagE collECtion"

garbage collection
Garbage Collection

The following example illustrates a function that takes a phrase and turns it into an acronym by concatenating the first letters of the words and capitalizing all the letters. Does the code make sense?

def acronymize(phrase):
    a = ''                   # start with the empty string
    for w in phrase.split(): # iterate through words in the phrase
        a += w[0]            # pick the first letter of the words and concatenate
    return a.upper()         # capitalize and return
acronymize("Be right back"), acronymize("Your mileage might vary")
('BRB', 'YMMV')

It can also be useful to convert a string representing a number to a number type, and vice versa. The following examples illustrate how these tasks can be achieved.

number = 12.345

s = str(number)
print( s, type(s))

f = float(s)
print(f, type(f))

i = int('345')
print(i, type(i))
12.345 <class 'str'>
12.345 <class 'float'>
345 <class 'int'>

We can also check if a string t is a substring of another string s via t in s (pattern matching).

t1 = "is"
t2 = "has"

s = "This is my car."

print(t1 in s)
print(t2 in s)

If we want to obtain the index at which a substring begins, we can use the find() method. If the substring is not found, -1 is returned.


We shall revisit Python strings when we discuss Natural Language Processing.

  1. Complete the definition of the function myRep() with arguments x, y, and n (where x and y can be assumed to be strings and n can be assumed to be a nonnegative integer) that returns the string x+y repeated n times.
def myRep(x, y, n):
    res = ''
    # Your code here
    return res

Verify that the function behaves as expected:

assert(myRep('a','b',3) == 'ababab')
assert(myRep('Python','C',0) == '')
  1. Complete the definition of the function posOfi() with argument s and returns a list of indices at which s contains the letter ‘i’ (hint: use the enumerate function).
def posOfi(s):
    # Your code here
    return None

Verify that the function behaves as expected:

print(posOfi("Harry Potter"))
  1. Complete the following function which takes a string consisting of a paragraph of sentences ending with a period and returns a list of all the sentences, with leading and trailing spaces stripped. You may assume that every period ends a proper sentence and there are no sentences not ending in a period.
def sentences(p):
    # Your code here
    return None

Verify that the function behaves as expected:

p = 'The essence of Python.  One can sense. But not learn. '
  1. What effect do the methods upper(), lower(), and title() have on non-alphabetical characters?

  2. Complete the following function which takes a list of full names as argument an returns a list of names that are not properly capitalized. For example, for the argument ['John Doe', 'JANE Kelly', 'nicole dunn', 'David Huang'], the function returns ['JANE Kelly, 'nicole Dunn'].

def badNames(names):
    # Your code here
    return None
  1. Complete the following function which takes a list l of strings as argument and returns a list consisting of the strings in l not containing the symbol -. For example, given the argument ['Hi', 'Good-bye', 'Ciao', 'Twenty-one'], the function should return ['Hi', 'Ciao'].
def filterList(l):
    # Your code here
    return None


A dictionary is a data structure for key-value pairs (k:v). To define a dictionary, simply list the key-value pairs enclosed within braces ({,}), as shown in the following examples.

The simplest dictionary is the one that is empty:

d = {}  # This creates an empty dictionary

<class 'dict'>

A more interesting dictionary could be the one below:

days = { 'Sun': 1, 'Mon': 2, 'Tue':3, 'Wed':4, 'Thu':5, 'Fri':6, 'Sat':7 }

<class 'dict'>

We can access the value for key k in dictionary d via d[k]. Note that an exception will be raised if d does not contain the key k.

We can check if a key k is in a dictionary d via k in d.


print('Aug' in days)

We can add a new key-value pair k:v to a dictionary d via d[k] = v.

d[2]= 3.45
d['three']= 'string'

{1: (1, 2), 2: 3.45, 'three': 'string'}

Conversely, we can delete key k and its associated value from dictionary d via del d[k].

del d[2]

{1: (1, 2), 'three': 'string'}

We can also iterate over the keys in a dictionary using a for loop.

for key in d:
    print(type(key), type(d[key]))
<class 'int'> <class 'tuple'>
<class 'str'> <class 'str'>

The following code gives the same output

for key, value in d.items():
    print(type(key), type(value))
<class 'int'> <class 'tuple'>
<class 'str'> <class 'str'>
  1. Complete the following function which takes a list of pairs as argument and returns a dictionary with the first components as keys and the second components as the corresponding values. For example, given the argument [(1,'a'),(2,'b')], the function returns {1: 'a', 2: 'b'}.
def pairListToDict(pairs):
    # Your code here
    return None
  1. Complete the following function which takes a dictionary as argument and removes all the key-value pairs that do not have values of type str. For example, calling the function with the dictionary {'one': 1, 'two': 'Two', 'three': 3} will change the dictionary to {'two': 'Two'}.
def filter(d):
    # Your code here

1.5.3 NumPy and Arrays

NumPy is a Python module that supports numerical computation on multi-dimensional arrays. It comes with many useful mathematical functions.

It is the backbone to the scientific computing library SciPy and data analysis and manipulation library pandas. Even though it is possible to do basic statisical analysis using a comprehensive statistics package without direct manipulation of NumPy arrays, knowledge of NumPy is essential for performing custom operations.

In this section, we get a taste of NumPy arrays of dimension at most two. What is covered only scratches the surface of this powerful library. A handy cheat sheet can be found here.

It is customary to use the alias np when importing the module.

import numpy as np


Unlike lists, NumPy arrays cannot contain elements of different types. There are various ways to create such arrays.

We can create a 1D array from a list:

x = np.array([1,2,3,4]) 


shape is the method that returns the array’s dimensions. We can create a 2D array from a list of lists:

y = np.array([[1,2,3],[4,5,6]]) 

(2, 3)

If some of the elements are not of the “right” type, they are converted automatically:

c = np.array(['n','u','m',15]) 

['n' 'u' 'm' '15']

We can also define a NumPy array out of a range using the arange() function:


array([1, 2, 3, 4])
['n' 'u' 'm' '15']

yields the same result as np.array([1,2,3,4]), but it is more efficient, from a computational perspective.

We can also obtain special arrays, composed of zeros, or composed of ones, with the functions zeros() and ones(). Here is a 3x4 2D array of 0s:

z = np.zeros([3,4]) # A 3-by-4 array of 0's
(3, 4)

and 2x1x3 3D array of 1s:

f = np.ones([2,3,4]) # A 2x1x3 3D array of 1's

Note the difference between the shape and ndim methods: the former gives the actual dimensions (number of rows, columns, etc.), the latter, the number of dimensions (axes).

We can also define NumPy arrays containing random values; for instance, here is a 1D array of 10 random values sampled from the standard normal distribution, using the function random.normal():

r = np.random.normal(size=10)
[-1.10501533 -0.69929125 -0.00882625  1.12738611  0.60354054  1.50509863
  1.07440466 -0.86260135  1.12680367 -0.01988042]


Adding and subtracting NumPy arrays of the same dimensions works as we would expect. Using x and y as above, and x2 as below, we get:

w = np.array([-1,-2,-3,-4])
[0 0 0 0]
[2 4 6 8]
[[ 2  4  6]
 [ 8 10 12]]

Multiplication by a scalar also works as expected:

[2 4 6 8]

However, note that multiplication and division via * and / (resp.) are applied component-wise:

[ -1  -4  -9 -16]

as is exponentiation:

[[  1   8  27]
 [ 64 125 216]]

Broadcasting allows addition and substraction to be performed between arrays that do not have the same shape. There are rules governing when such operations are valid and what the effects are. Here, we provide two simple examples:

x + 3.5
array([4.5, 5.5, 6.5, 7.5])
y - 1
array([[0, 1, 2],
       [3, 4, 5]])

Can you determine what broadcasting does from these examples?

Math Functions

NumPy contain some useful methods mapping arrays to a scalar.

For instance, sum adds up the elements in the array.


(the same result could have been obtained with np.sum(x)).

The usual statistical descriptions are also available as methos:

1.118033988749895 1.25 2.5

NumPy also has a collection of mathematical functions that can be applied component-wise, such as abs() and exp():

[1.10501533 0.69929125 0.00882625 1.12738611 0.60354054 1.50509863
 1.07440466 0.86260135 1.12680367 0.01988042]
[[  2.71828183   7.3890561   20.08553692]
 [ 54.59815003 148.4131591  403.42879349]]

NumPy functions are more efficient when it comes to array computations; they should be used whenever possible.

Logic Operations

Operations over arrays of boolean values can also be performed efficiently in NumPy.

Let us create a boolean array bx of the same shape as x, with bx[i] = True if and only if x[i] >= 2.5, and a boolean array by of the same shape as y, with by[i] = True if and only if y[i] >= 3.5.

bx = x >= 2.5  
by = y >= 3.5

[False False  True  True]
[[False False False]
 [ True  True  True]]

Comparison of two NumPy arrays of the same shape results in a boolean array, yet again of the same shape. Note that comparison is performed component-wise:

x2 = np.array([2,1,3,0])

print(x == x2)
[False False  True False]

Comparisons use the symbols ==, <, and >:

print(x > x2)
[False  True False  True]

We can perform boolean operations (AND, OR, NEG) on boolean arrays:

b = np.array([True, False, True, True])

AND is computed using &:

b & bx
array([False, False,  True,  True])

OR with |:

b | bx
array([ True, False,  True,  True])

NEG with ~:

array([False,  True, False, False])

We can also sum over the values of a boolean array (in this case, True is interpreted as 1 and False as 0):

  1. Complete the following code so that sq is a 1D numpy array of the squares of the first 100 positive integers. Use list comprehension.
sq = np.array([...])
  1. Obtain a NumPy array from the array sq in the section by applying the function \(\sqrt{x}+1\) to each entry x in sq (hint: use broadcasting and np.sqrt()).

  2. Complete the following definition of myFunc() which takes a positive integer argument n and a positive real number d and generates an array of n random values drawn from the standard normal distribution and returns the number of values whose absolute values are less than or equal to d.

You may assume that n is a positive integer and d is a nonnegative float when myFunc() is called (hint: use numpy.random.randn() for generating the random array).

def myFunc(n, d):
    # Your code here
    return 0

Verify that the function behaves as expected:

assert(myFunc(10000,1) == 6848)
assert(myFunc(100000,2) == 95490)