3.1 Basic Notions

Probability theory is the mathematical discipline relating to the numerical description of the likelihood of an event.

3.1.1 Sample Spaces and Events

Throughout, we will deal with random experiments (e.g., measurements of speed/ weight, number and duration of phone calls, etc.).

For any “experiment,” the sample space is defined as the set of all its possible outcomes, often denoted by the symbol \({\cal S}\). A sample space can be discrete or continuous. An event is a collection of outcomes from the sample space \({\cal S}\). Events will be denoted by \(A\), \(B\), \(E_1\), \(E_2\), etc.

  • Toss a fair coin – the corresponding (discrete) sample space is \({\cal S} =\{\mbox{Head, Tail}\}\).

  • Roll a die – the corresponding (discrete) sample space is \({\cal S}=\left\{ 1,2,3,4,5,6 \right\}\), with various events represented by

    • rolling an even number: \(\left\{ 2,4,6 \right\}\);

    • rolling a prime number: \(\left\{ 2,3,5 \right\}\).

  • Suppose we measure the weight (in grams) of a chemical sample – the (continuous) sample space can be represented by \(\mathcal{S}=(0,\infty)\), the positive half line, and various events by subsets of \(\mathcal{S}\), such as

    • sample is less than 1.5 grams: \((0,1.5)\);

    • sample exceeds 5 grams: \((5,\infty)\).

For any events \(A,B\subseteq {\cal S}\):

  • the union \(A\cup B\) of \(A\) and \(B\) are all outcomes in \({\cal S}\) contained in either \(A\) or \(B\);

  • the intersection \(A\cap B\) of \(A\) and \(B\) are all outcomes in \({\cal S}\) contained in both \(A\) and \(B\);

  • the complement \(A^c\) of \(A\) (sometimes denoted \(\overline{A}\) or \(-A\)) is the set of all outcomes in \({\cal S}\) that are not in \(A\).

If \(A\) and \(B\) have no outcomes in common, they are mutually exclusive; which is denoted by $AB=$ (the empty set). In particular, \(A\) and \(A^{c}\) are always mutually exclusive.20

  • Roll a die and let \(A=\left\{ 2,3,5 \right\}\) (a prime number) and \(B=\left\{ 3,6 \right\}\) (multiples of \(3\)). Then \(A\cup B=\left\{ 2,3,5,6 \right\}\), \(A\cap B=\left\{3 \right\}\) and \(A^{c}=\left\{ 1,4,6 \right\}\).

  • \(100\) plastic samples are analyzed for scratch and shock resistance.

    shock resistance
    high low
    scratch high \({70}\) \(4\)
    resistance low \(1\) \(25\)

    If \(A\) is the event that a sample has high shock resistance and \(B\) is the event that a sample has high scratch residence, then \(A\cap B\) consists of \(70\) samples.

3.1.2 Counting Techniques

A two-stage procedure can be modeled as having \(k\) bags, with \(m_1\) items in the first bag, …, \(m_k\) items in \(k\)-th bag.

The first stage consists of picking a bag, and the second stage consists of drawing an item out of that bag. This is equivalent to picking one of the \(m_1+\cdots+m_k\) total items.

If all the bags have the same number of items \[m_1=\cdots=m_k=n,\] then there are \(kn\) items in total, and this is the total number of ways the two-stage procedure can occur.

  • How many ways are there to first roll a die and then draw a card from a (shuffled) \(52-\)card pack?

    Answer: there are \(6\) ways the first step can turn out, and for each of these (the stages are independent, in fact) there are \(52\) ways to draw the card. Thus there are \(6\times52=312\) ways this can turn out.

  • How many ways are there to draw two tickets numbered \(1\) to \(100\) from a bag, the first with the right hand and the second with the left hand?

    Answer: There are \(100\) ways to pick the first number; for each of these there are \(99\) ways to pick the second number. Thus \(100\times 99=9900\) ways.

Multi-Stage Procedures

A \(k\)-stage process is a process for which:

  • there are \(n_1\) possibilities at stage 1;

  • regardless of the 1st outcome there are \(n_2\) possibilities at stage 2,

  • regardless of the previous outcomes, there are \(n_k\) choices at stage \(k\).

There are then \[n_1 \times n_2\cdots\times n_k\] total ways the process can turn out.

3.1.3 Ordered Samples

Suppose we have a bag of \(n\) billiard balls numbered \(1,\ldots,n\). We can draw an ** ordered sample** of size \(r\) by picking balls from the bag:

  • with replacement, or

  • without replacement.

With how many different collection of \(r\) balls can we end up in each of those cases (each is an \(r\)-stage procedure)?

Key Notion: all the object (balls) can be differentiated (using numbers, colours, etc.)

Sampling With Replacement (Order Important)

If we replace each ball into the bag after it is picked, then every draw is the same (there are \(n\) ways it can turn out).

According to our earlier result, there are \[\underbrace{\ n\times n \times \cdots \times n\ }_{\text{$r$ stages}}= n^r\] ways to select an ordered sample of size \(r\) with replacement from a set with \(n\) objects \(\left\{ 1,2,\ldots,n \right\}\).

Sampling Without Replacement (Order Important)

If we do not replace each ball into the bag after it is drawn, then the choices for the second draw depend on the result of the first draw, and there are only \(n-1\) possible outcomes.

Whatever the first two draws were, there are \(n-2\) ways to draw the third ball, and so on.

Thus there are \[\underbrace{n\times (n-1)\times\cdots\times (n-r+1)}_{\text{$r$ stages}} = {_nP_r}\ \ \text{ (common symbol)}\] ways to select an ordered sample of size \(r\le n\) without replacement from a set of \(n\) objects \(\left\{ 1,2,\ldots,n \right\}\).

Factorial Notation

For a positive integer \(n\), write \[n!=n(n-1)(n-2)\cdots1.\] There are two possibilities:

  • when \(r=n\), \(_nP_r=n!\), and the ordered selection (without replacement) is called a permutation;

  • when \(r<n\), we can write \[\begin{aligned} _nP_r&= \frac{n(n-1)\cdots(n-r+1)}{} \frac {(n-r)\cdots 1} {(n-r)\cdots 1} \\&= \frac{n!}{(n-r)!}=n\times \cdots \times (n-r+1).\end{aligned}\]

By convention, we set \(0!=1\), so that \[_nP_r=\frac{n!}{(n-r)!},\quad \text{for all $r\le n$.}\]

  • In how many different ways can 6 balls be drawn in order without replacement from a bag of balls numbered 1 to 49?

    Answer: We compute \[_{49}P_6=49\times 48\times 47\times 46\times 45\times 44 = 10,068,347,520.\] This is the number of ways the actual drawing of the balls can occur for Lotto 6/49 in real-time (balls drawn one by one).

  • How many 6-digits PIN codes can you create from the set of digits \(\{0,1,\ldots,9\}\)?

    Answer: If the digits may be repeated, we see that \[10\times 10\times 10\times 10\times 10\times 10=10^6=1,000,000.\] If the digits may not be repeated, we have instead \[_{10}P_6=10\times 9\times 8\times 7\times 6\times 5=151,200.\]

3.1.4 Unordered Samples

Suppose now that we cannot distinguish between different ordered samples; when we look up the Lotto 6/49 results in the newspaper, for instance, we have no way of knowing the order in which the balls were drawn: \[1-2-3-4-5-6\] could mean that the first drawn ball was ball # \(1\), the second drawn ball was ball # \(2\), etc., but it could also mean that the first ball drawn was ball # \(4\), the second one, ball # \(3\), etc., or any other combinations of the first 6 balls.

Denote the (as yet unknown) number of unordered samples of size \(r\) from a set of size \(n\) by \(_nC_r\). We can derive the expression for \(_nC_r\) by noting that the following two processes are equivalent:

  • take an ordered sample of size \(r\) (there are \(_nP_r\) ways to do this);

  • take an unordered sample of size \(r\) (there are \(_nC_r\) ways to do this) and then rearrange (permute) the objects in the sample (there are \(r!\) ways to do this).

Thus \[\begin{aligned} _nP_r = {_nC_r}\times r!&\implies& _nC_r = \frac{ _nP_r} {r!} = \frac{n!}{(n-r)!\ r!} = \binom n r\,.\end{aligned}\] This last notation is called a binomial coefficient (read as “\(n\)-choose-\(r\)”) and is commonly used in textbooks.

Example: in how many ways can the “Lotto 6/49 draw” be reported in the newspaper (if they are always reported in increasing order)?

Answer: this number is the same as the number of unordered samples of size 6 (different re-orderings of same 6 numbers are indistinguishable), so \[\begin{aligned} _{49}C_6&=\binom {49}6 =\frac{49\times 48\times 47\times 46\times 45\times 44} {6\times 5\times4\times3\times2\times 1} \\ &= \frac{ 10,068,347,520 }{720} = 13,983,816\,.\end{aligned}\] There exists a variety of binomial coefficient identities, such as \[\begin{aligned} \binom{n}{k}&=\binom{n}{n-k},\quad \text{for all $0\leq k\leq n$}, \\ \sum_{k=0}^n\binom{n}{k}&=2^n, \quad \text{for all $0\leq n$}, \\ \binom{n+1}{k+1}&=\binom{n}{k}+\binom{n}{k+1},\quad \text{for all $0\leq k\leq n-1$} \\ \sum_{j=k}^n\binom{j}{k}&=\binom{n+1}{k+1}, \quad \text{for all $0\leq n$, etc.}.\end{aligned}\]

3.1.5 Probability of an Event

For situations where we have a random experiment which has exactly \(N\) possible mutually exclusive, equally likely outcomes, we can assign a probability to an event \(A\) by counting the number of outcomes that correspond to \(A\) – its relative frequency.

If that count is \(a\), then \[P(A) = \frac{a}{N}.\] The probability of each individual outcome is thus \(1/N\).


  • Toss a fair coin – the sample space is \({\cal S} = \{\mbox{Head, Tail}\}\), i.e., \(N=2\). The probability of observing a Head on a toss is thus \(\frac{1}{2}.\)

  • Throw a fair six sided die. There are \(N=6\) possible outcomes. The sample space is \[{\cal S} = \{1,2,3,4,5,6\}.\] If \(A\) corresponds to observing a multiple of \(3\), then \(A=\{3,6\}\) and \(a=2\), so that \[\mbox{Prob(number is a multiple of $3$)} =P(A) =\frac{2}{6} =\frac{1}{3}.\]

  • The probabilities of seeing an even/odd number are: \[\begin{aligned} \text{Prob$\left\{ \text{even} \right\}$}&=P\left(\left\{ 2,4,6 \right\}\right)=\frac{3}{6}=\frac{1}{2}; \\ \text{Prob$\left\{ \text{prime} \right\}$} &=P\left(\left\{ 2,3,5 \right\}\right)=1-P\left(\left\{ 1,4,6 \right\}\right)=\frac{1}{2}.\end{aligned}\]

  • In a group of \(1000\) people it is known that \(545\) have high blood pressure. \(1\) person is selected randomly. What is the probability that this person has high blood pressure?

    Answer: the relative frequency of people with high blood pressure is \(0.545\).

This approach to probability is called the frequentist interpretation. It is based on the idea that the theoretical probability of an event is given by the behaviour of the empirical (observed) relative frequency of the event over long-run repeatable and independent experiments (i.e.when \(N\to \infty\)).

This is the classical definition, and the one used in this document, but there are competing interpretations which may be more appropriate depending on the context; chiefly, the Bayesian interpretation (see [26], [27] for details) and the propensity interpretation (introducing causality as a mechanism).

Axioms of Probability

The modern definition of probability is axiomatic (according to Kolmogorov’s seminal work [28]).

The probability of an event \(A\subseteq \mathcal{S}\) is a numerical value satisfying the following properties:

  1. for any event \(A\), \(1\ge P(A) \ge 0\);

  2. for the complete sample space \({\cal S}\), \(P({\cal S}) = 1\);

  3. for the empty event $$, \(P(\varnothing )=0\), and

  4. for two mutually exclusive events \(A\) and \(B\), the probability that \(A\) or \(B\) occurs is \(P(A\cup B)=P(A) + P(B).\)

Since \({\cal S}=A\cup A^c\), and \(A\) and \(A^c\) are mutually exclusive, then \[\begin{aligned} 1 &\stackrel{\textbf{A2}}= P\left( {\cal S} \right) =P\left( A\cup A^c \right) \stackrel{\textbf{A4}}=P(A)+P\left( A^c \right) \\ &\implies {P(A^c)=1-P(A)}.\end{aligned}\]


  • Throw a single six sided die and record the number that is shown. Let \(A\) and \(B\) be the events that the number is a multiple of or smaller than \(3\), respectively. Then \(A = \{3,6\}\), \(B = \{1,2\}\) and \(A\) and \(B\) are mutually exclusive since $AB=$. Then \[P(A \mbox{ or } B) = P(A\cup B)=P(A) + P(B) = \frac{2}{6}+\frac{2}{6} = \frac{2}{3}.\]

  • An urn contains \(4\) white balls, \(3\) red balls and \(1\) black ball. Draw one ball, and denote the following events by \(W =\{\mbox{the ball is white}\}\), \(R =\{\mbox{the ball is red}\}\) and \(B =\{\mbox{the ball is black}\}\). Then \[P(W)=1/2,\quad P(R)=3/8,\quad P(B)=1/8,\] and \(P(W\mbox{ or } R)=7/8.\)

General Addition Rule

This useful rule is a direct consquence of the axioms of probability: \[P(A\cup B) = P(A)+P(B)-P(A\cap B).\]

Example: an electronic gadget consists of two components, \(A\) and \(B\). We know from experience that \(P( \text{$A$ fails})= 0.2\), \(P( \text{$B$ fails} )= 0.3\) and \(P( \text{both $A$ and $B$ fail} )= 0.15\). Find \(P( \text{at least one of $A$ and $B$ fails} )\) and \(P( \text{neither $A$ nor $B$ fails} )\).

Answer: write \(A\) for “\(A\) fails” and similarly for \(B\). Then we are looking to compute \[\begin{aligned} P( \text{at least one fails} ) &= P(A\cup B) \\ &= P(A)+P(B)-P(A\cap B) = 0.35\,;\\ P( \text{neither fail})&= 1-P( \text{at least one fails} )=0.65\,.\end{aligned}\] If \(A\), \(B\) are mutually exclusive, \(P(A\cap B) =P(\varnothing )=0\) and \[P(A\cup B)=P(A)+P(B)-P(A\cap B)=P(A)+P(B).\] With three events, the addition rule expands as follows: \[\begin{aligned} P(A\cup B\cup C)=& P(A)+P(B)+P(C)\\ & \quad -P(A\cap B)-P(A\cap C)-P(B\cap C) \\ & \quad +P(A\cap B\cap C).\end{aligned}\]

3.1.6 Conditional Probability and Independent Events

Any two events \(A\) and \(B\) satisfying \[P\left( A\cap B \right) = P(A)\times P(B)\] are said to be independent.21 When events are not independent, we say that they are dependent or conditional.

Mutual exclusivity and independence are unrelated concepts. The only way for events \(A\) and \(B\) to be mutually exclusive and independent is for either \(A\) or \(B\) (or both) to be a non-event (the empty event): \[\begin{aligned} \varnothing &=P(A \cap B)=P(A)\times P(B) \implies P(A)=0\text{ or }P(B)=0 \\ & \implies A=\varnothing \text{ or }B=\varnothing .\end{aligned}\]


  • Flip a fair coin twice – the 4 possible outcomes are all equally likely: \(\mathcal{S}=\{HH,HT,TH,TT\}\). Let \[A=\{HH\}\cup \{HT\}\] denote “head on first flip”, \(B=\{HH\}\cup \{TH\}\) “head on second flip”. Note that \(A\cup B\neq \mathcal{S}\) and \(A\cap B=\{HH\}\). By the general addition rule, \[\begin{aligned} P\left( A \right) & = P(\{HH\})+P(\{HT\})-P(\{HH\}\cap \{HT\}) \\ &= \frac14+\frac14-P(\varnothing ) = \frac{1}{2}-0 =\frac12\,.\end{aligned}\] Similarly, \(P\left( B \right) = P(\{HH\})+P(\{TH\}) = \frac12\), and so \(P(A)P(B)=\frac14\). But \(P(A\cap B) = P(\{HH\})\) is also \(\frac14\), so \(A\) and \(B\) are independent.

  • A card is drawn from a regular well-shuffled 52-card North American deck. Let \(A\) be the event that it is an ace and \(D\) be the event that it is a diamond. These two events are independent. Indeed, there are \(4\) aces \[P(A)=\frac{4}{52}=\frac{1}{13}\] and \(13\) diamonds \[P(D)=\frac{13}{52}=\frac{1}{4}\] in such a deck, so that \[P(A)P(D)=\frac{1}{13}\times\frac{1}{4}=\frac{1}{52}\,,\] and exactly \(1\) ace of diamonds in the deck, so that \(P(A \cap D)\) is also \(\frac{1}{52}\).

  • A six-sided die numbered \(1-6\) is loaded in such a way that the probability of rolling each value is proportional to that value. Find \(P(3)\).

    Answer: Let \(\mathcal{S}=\{1,2,3,4,5,6\}\) be the value showing after a single toss; for some proportional constant \(v\), we have \(P(k)=kv\), for \(k\in \mathcal{S}\). By Axiom \(\textbf{A2}\), \(P(\mathcal{S})=P(1)+\cdots+P(6)=1\), so that \[1=\sum_{k=1}^6P(k)=\sum_{k=1}^6kv=v\sum_{k=1}^6k=v\frac{(6+1)(6)}{2}=21v\,.\] Hence \(v=1/21\) and \(P(3)=3v=3/21=1/7\).

  • Now the die is rolled twice, the second toss independent of the first. Find \(P(3_1,3_2)\).

    Answer: the experiment is such that \(P(3_1)=1/7\) and \(P(3_2)=1/7\), as seen in the previous example. Since the die tosses are independent,22 then \[P\left(3_1\cap3_2 \right) =P(3_1)P(3_2)=1/49\,.\]

  • Is a 2-engine plane more likely to be forced down than a 3-engine plane?

    Answer: this question is easier to answer if we assume that engines fail independently (this is no doubt convenient, but the jury is still out as to whether it is realistic). In what follows, let \(p\) be the probability that an engine fails.23

    The next step is to decide what type engine failure will force a plane down:24

    • A 2-engine plane will be forced down if both engines fail – the probability is \(p^2\);

    • A 3-engine plane will be forced down if any pair of engines fail, or if all 3 fail.

      • Pair: the probability that exactly 1 pair of engines will fail independently (i.e., two engines fail and one does not) is \[p\times p\times (1 - p).\] The order in which the engines fail does not matter: there are \(_3C_2=\frac{3!}{2!1!}=3\) ways in which a pair of engines can fail: for 3 engines A, B, C, these are AB, AC, BC.

      • All 3: the probability of all three engines failing independently is \(p^3\).

      The probability \(\geq 2\) engines failing is thus \[P(2+ \text{ engines fail})=3p^2(1 - p) + p^3 = 3p^2 - 2p^3.\]

    Basically it’s safer to use a 2-engine plane than a 3-engine plane: the 3-engine plane will be forced down more often, assuming it needs \(2\) engines to fly.

    This “makes sense”: the 2-engine plane need 50% of its engines working, while the 3-engine plane needs 66% (see Figure 3.1 to get a sense of what the probabilities are for \(0\leq p\leq 1\)).

Failure probability for the 2-engine and 3-engine planes.

Figure 3.1: Failure probability for the 2-engine and 3-engine planes.

  • (Taken from [29]) Air traffic control is a safety-related activity – each piece of equipment is designed to the highest safety standards and in many cases duplicate equipment is provided so that if one item fails another takes over.

    A new system is to be provided passing information from Heathrow Airport to Terminal Control at West Drayton. As part of the system design a decision has to be made as to whether it is necessary to provide duplication. The new system takes data from the Ground Movements Radar (GMR) at Heathrow, combines this with data from the National Airspace System NAS, and sends the output to a display at Terminal Control (a conceptual model is shown in Figure 3.2).

    Conceptual model of air traffic control security system.

    Figure 3.2: Conceptual model of air traffic control security system.

    For all existing systems, records of failure are kept and an experimental probability of failure is calculated annually using the previous 4 years.

    The reliability of a system is defined as \(R=1 - P\), where \(P=P(\text{failure})\).

    \(R_{\text{GMR}} = R_{\text{NAS}} = 0.9999\) (i.e., \(1\) failure in \(10,000\) hours).

    the components’ failure probabilities are independent.

    For the system above, if a single module is introduced the reliability of the system (STD – single thread design) is \[R_{\text{STD}}=R_{\text{GMR}} \times R_{\text{NEW}} \times R_{\text{NAS}}.\]

    If the module is duplicated, the reliability of this dual thread design is \[R_{\text{DTD}}=R_{\text{GMR}} \times (1 - (1 - R_{\text{NEW}} )^2) \times R_{\text{NAS}}.\] Duplicating the module causes an improvement in reliability of \[\rho=\frac{R_{\text{DTD}}}{R_{\text{STD}}}=\frac{(1 - (1 - R_{\text{NEW}} )^2)}{R_{\text{NEW}}} \times 100\%\,.\]

    For the module, no historical data is available. Instead, we work out the improvement achieved by using the dual thread design for various values of \(R_{\text{NEW}}\).

    \(R_{\text{NEW}}\) 0.1 0.2 0.5 0.75 0.99 0.999 0.9999 0.99999
    \(\rho\) (%) 190 180 150 125 101 100.1 100.01 100.001

    If the module is very unreliable (i.e., \(R_{\text{NEW}}\) is small), then there is a significant benefit in using the dual thread design (\(\rho\) is large).25

    If the new module is as reliable as and , that is, if \[R_{\text{GMR}} =R_{\text{NEW}} = R_{\text{NAS}} = 0.9999,\] then the single thread design has a combined reliability of \(0.9997\) (i.e., \(3\) failures in \(10,000\) hours), whereas the dual thread design has a combined reliability of \(0.9998\) (i.e., \(2\) failures in \(10,000\) hours).

    If the probability of failure is independent for each component, we could conclude from this that the reliability gain from a dual thread design probably does not justify the extra cost.

In the last two examples, we had to make additional assumptions in order to answer the questions – this is often the case in practice.

Conditional Probability

It is easier to understand independence of events through the conditional probability of an event \(B\) given that another event \(A\) has occurred, defined by as \[\begin{aligned} {P(B \mid A)}=\frac{P(A\cap B)}{P(A)}\,.\end{aligned}\] Note that this definition only makes sense when “\(A\) can happen” i.e., \(P(A)>0\). If \(P(A)P(B)>0\), then \[P(A\cap B)=P(A)\times P(B \mid A)=P(B)\times P(A \mid B)=P(B\cap A);\] \(A\) and \(B\) are thus independent if \(P(B \mid A)=P(B)\) and \(P(A \mid B)=P(A)\).


  • From a group of 100 people, 1 is selected. What is the probability that this person has high blood pressure (HBP)?

    Answer: if we know nothing else about the population, this is an (unconditional) probability, namely \[P(\text{HBP})=\frac{\# \text{individuals with HBP in the population}}{100}\,.\]

  • If instead we first filter out all people with low cholesterol level, and then select 1 person. What is the probability that this person has HBP?

    Answer: this is the conditional probability \[P(\text{HBP} \mid \text{high cholesterol});\] the probability of selecting a person with HBP, given high cholesterol levels, presumably different from \(P(\text{HBP} \mid \text{low cholesterol})\).

  • A sample of \(249\) individuals is taken and each person is classified by blood type and tuberculosis (TB) status.

    O A B AB Total
    TB \(34\) \(37\) \(31\) \(11\) \(113\)
    no TB \(55\) \(50\) \(24\) \(7\) \(136\)
    Total \(89\) \(87\) \(55\) \(18\) \(249\)

    The (unconditional) probability that a random individual has TB is \(P(\text{TB})=\frac{\# \text{TB}}{249}=\frac{113}{249} = 0.454\). Among those individuals with type B blood, the (conditional) probability of having TB is \[P(\text{TB} \mid \text{type {\bf B}})= \frac{P(\text{TB}\cap \text{type {\bf B}})}{P(\text{type {\bf B}})}=\frac{31}{55} = \frac{31/249}{55/249}= 0.564.\]

  • A family has two children (not twins). What is the probability that the youngest child is a girl given that at least one of the children is a girl? Assume that boys and girls are equally likely to be born.

    Answer: let \(A\) and \(B\) be the events that the youngest child is a girl and that at least one child is a girl, respectively: \[A=\{\text{GG},\text{BG}\} \quad\mbox{and}\quad B=\{\text{GG},\text{BG},\text{GB}\},\] \(A\cap B=A.\) Then \(P(A \mid B)=\frac{P(A\cap B)}{P(B)} = \frac{P(A)}{P(B)}=\frac{2/4}{3/4}=\frac{2}{3}\) (and not \(\frac12\), as might naively be believed).
    Incidentally, \(P(A\cap B)=P(A)\neq P(A)\times P(B)\), which means that \(A\) and \(B\) are not independent events.

Law of Total Probability

Let \(A\) and \(B\) be two events. From set theory, we have \[B=(A\cap B)\cup (\overline{A}\cap B),\] as illustrated in Figure 3.3.

Decomposition of $B$ *via* $A$.Decomposition of $B$ *via* $A$.Decomposition of $B$ *via* $A$.

Figure 3.3: Decomposition of \(B\) via \(A\).

Note that \(A\cap B\) and \(\overline{A}\cap B\) are mutually exclusive, so that, according to Axiom \(\textbf{A4}\), we have \[P(B)=P(A\cap B)+P(\overline{A}\cap B).\] Now, assuming that \(\varnothing \neq A \neq \mathcal{S}\), \[P(A\cap B) = P(B \mid A)P(A) \quad\mbox{and} \quad P(\overline{A}\cap B)=P(B \mid \overline{A})P(\overline{A}),\] so that \[P(B) = P(B \mid A)P(A)+P(B \mid \overline{A})P(\overline{A}).\] This generalizes as follows: if \(A_1,...A_k\) are mutually exclusive and exhaustive (i.e., $A_iA_j=$ for all \(i\neq j\) and \(A_1\cup ....\cup A_k=\mathcal{S})\), then for any event \(B\) \[\begin{aligned} P(B)&=\sum_{j=1}^kP(B \mid A_j)P(A_j)\\&=P(B \mid A_1)P(A_1 )+...+P(B \mid A_k)P(A_k).\end{aligned}\]

Example: use the Law of Total Probability (rule above) to compute \(P(\text{TB})\) using the data from the previous example.

Answer: the blood types \(\{\textbf{O},\textbf{A},\textbf{B},\textbf{AB}\}\) form a mutually exclusive partition of the population, with \[P(\textbf{O})=\frac{89}{249},\ P(\textbf{A})=\frac{87}{249},\ P(\textbf{B})=\frac{55}{249},\ P(\textbf{AB})=\frac{18}{249}.\] It is easy to see that \(P(\textbf{O})+ P(\textbf{A})+P(\textbf{B})+ P(\textbf{AB})=1\). Furthermore, \[\begin{aligned} P(\text{TB} \mid \textbf{O})&=\textstyle{\frac{P(\text{TB}\cap\textbf{O})}{P(\textbf{O})}=\frac{34}{89},\ P(\text{TB} \mid \textbf{A})=\frac{P(\text{TB}\cap\textbf{A})}{P(\textbf{A})}=\frac{37}{87}},\\ P(\text{TB} \mid \textbf{B})&=\textstyle{\frac{P(\text{TB}\cap\textbf{B})}{P(\textbf{B})}=\frac{31}{55},\ P(\text{TB} \mid \textbf{AB})=\frac{P(\text{TB}\cap\textbf{AB})}{P(\textbf{AB})}=\frac{11}{18}}.\end{aligned}\]

According to the law of total probability, \[\begin{aligned} P(\text{TB})=P(\text{TB} &\mid \textbf{O})P(\textbf{O})+P(\text{TB} \mid \textbf{A})P(\textbf{A})\\ & +P(\text{TB} \mid \textbf{B})P(\textbf{B})+P(\text{TB} \mid \textbf{AB})P(\textbf{AB}), \end{aligned}\] so that \[\begin{aligned} P(\text{TB})&=\frac{34}{89}\cdot\frac{89}{249}+\frac{37}{87}\cdot\frac{87}{249}+\frac{31}{55}\cdot\frac{55}{249}+\frac{11}{18}\cdot\frac{18}{249}\\ &=\frac{34+37+31+11}{249}=\frac{113}{249}=0.454, \end{aligned}\] which matches with the result of the previous example.

3.1.7 Bayes’ Theorem

After an experiment generates an outcome, we are often interested in the probability that a certain condition was present given an outcome (or that a particular hypothesis was valid, say).

We have noted before that if \(P(A)P(B)>0\), then \[P(A\cap B)=P(A)\times P(B \mid A)=P(B)\times P(A \mid B)=P(B\cap A);\] this can be re-written as Bayes’ Theorem: \[P(A \mid B)=\frac{P(B \mid A) \times P(A)}{P(B)}.\] Bayes’ Theorem is a powerful tool in probability analysis, but it is a simple corollary of the rules of probability.

Central Data Analysis Question

Given everything that was known prior to the experiment, does the collected/observed data support (or invalidate) the hypothesis/presence of a certain condition?

The problem is that this is usually impossible to compute directly. Bayes’ Theorem offers a possible solution: \[\begin{aligned} P(\text{hypothesis} \mid \text{data})&=\frac{P(\text{data} \mid \text{hypothesis})\times P(\text{hypothesis})}{P(\text{data})} \\ &\propto P(\text{data} \mid \text{hypothesis})\times P(\text{hypothesis}),\end{aligned}\] in which the terms on the right might be easier to compute than the term on the left.

Bayesian Vernacular

In Bayes’ Theorem:

  • \(P(\text{hypothesis})\) is the probability of the hypothesis being true prior to the experiment (called the prior);

  • \(P(\text{hypothesis} \mid \text{data})\) is the probability of the hypothesis being true once the experimental data is taken into account (called the posterior);

  • \(P(\text{data} \mid \text{hypothesis})\) is the probability of the experimental data being observed assuming that the hypothesis is true (called the likelihood).

The theorem is often presented as \(\text{posterior} \propto \text{likelihood} \times \text{prior}\), which is to say, beliefs should be updated in the presence of new information.


If \(A,B\) are events for which \(P(A)P(B)>0\), then Bayes’ Theorem can be re-written, using the law of total probability, as \[P(A \mid B)=\frac{P(B \mid A)P(A)}{P(B)}=\frac{P(B \mid A)P(A)}{P(B \mid A)P(A)+P(B \mid \overline{A})P(\overline{A})},\] or, in the general case where \(A_1,...A_k\) are mutually exclusive and exhaustive events, then for any event \(B\) and for each \(1\leq i\leq k\), \[\begin{aligned} P(A_i \mid B)&= \frac{P(B \mid A_i)P(A_i)}{P(B)}\\ &=\frac{P(B \mid A_i)P(A_i)}{P(B \mid A_1)P(A_1 )+...+P(B \mid A_k)P(A_k)}.\end{aligned}\]


  • In 1999, Nissan sold three car models in North America: Sentra (S), Maxima (M), and Pathfinder (PA). Of the vehicles sold that year, \(50\%\) were S, \(30\%\) were M and \(20\%\) were PA. In the same year \(12\%\) of the S, \(15\%\) of the M, and \(25\%\) of the PA had a particular defect \(D\).

    1. If you own a 1999 Nissan, what is the probability that it has the defect?

      in the language of conditional probability, \[\begin{aligned} P(\text{S})&=0.5, \ P(\text{M})=0.3, \ P(\text{Pa})=0.2, \\ P(D \mid \text{S})&=0.12, \ P(D \mid \text{M})=0.15, \ P(D \mid \text{PA})=0.25, \end{aligned}\]so that \[\begin{aligned} P(&D)=P(D \mid \text{S})\times P(\text{S})+P(D \mid \text{M})\times P(\text{M})\\&+P(D \mid \text{Pa})\times P(\text{Pa}) \\ &=0.12\cdot 0.5+0.15\cdot 0.3+0.25\cdot 0.2\\&= 0.155 = 15.5\%.\end{aligned}\]

    2. If a 1999 Nissan has defect \(D\), what model is it likely to be?

      in the first part we computed the total probability \(P(D)\); in this part, we compare the posterior probabilities \(P(\text{M} \mid D)\), \(P(\text{S} \mid D)\), and \(P(\text{Pa} \mid D)\) (and not the priors!), computed using Bayes’ Theorem: \[\begin{aligned} P(\text{S} \mid D)&=\textstyle{\frac{P(D \mid \text{S})P(\text{S})}{P(D)}=\frac{0.12\times 0.5}{0.155}\approx 38.7\%} \\ P(\text{M} \mid D)&=\textstyle{\frac{P(D \mid \text{M})P(\text{M})}{P(D)}=\frac{0.15\times 0.3}{0.155}\approx 29.0\%} \\ P(\text{Pa} \mid D)&=\textstyle{\frac{P(D \mid \text{Pa})P(\text{Pa})}{P(D)}=\frac{0.25\times 0.2}{0.155}\approx 32.3\%} \end{aligned}\] Even though Sentras are the least likely to have the defect \(D\), their overall prevalence in the population carry them over the hump.

  • Suppose that a test for a particular disease has a very high success rate. If a patient

    • has the disease, the test reports a ‘positive’ with probability 0.99;

    • does not have the disease, the test reports a ‘negative’ with prob 0.95.

    Assume that only 0.1% of the population has the disease. What is the probability that a patient who tests positive does not have the disease?

    Answer: Let \(D\) be the event that the patient has the disease, and \(A\) be the event that the test is positive. The probability of a true positive is \[\begin{aligned} P(D \mid A)&=\frac{P(A \mid D)P(D)}{P(A \mid D)P(D)+P(A \mid D^c)P(D^c)}\\ &=\frac{0.99\times 0.001}{0.99\times 0.001 + 0.05\times 0.999}\approx 0.019.\end{aligned}\] The probability of a false positive is thus \(1-0.019\approx 0.981\). Despite the apparent high accuracy of the test, the incidence of the disease is so low (\(1\) in a \(1000\)) that the vast majority of patients who test positive (\(98\) in \(100\)) do not have the disease.

    The \(2\) in \(100\) who are true positives still represent \(20\) times the proportion of positives found in the population (before the outcome of the test is known).[It is important to remember that when dealing with probabilities, both the likelihood and the prevalence have to be taken into account.]

  • (Monty Hall Problem) On a game show, you are given the choice of three doors. Behind one of the doors is a prize; behind the others, dirty and smelly rubbish bins (as is skillfully rendered in Figure 3.4).

    You pick a door, say No. 1, and the host, who knows what is behind the doors, opens another door, say No. 3, behind which is a bin. She then says to you, “Do you want to switch from door No. 1 to No. 2?”

    Is it to your advantage to do so?

    The Monty  Hall     set-up.

    Figure 3.4: The Monty Hall set-up (personal file, but that was probably obvious).

    Answer: in what follows, let and be the events that switching to another door is a successful strategy and that the prize is behind the original door, respectively.

    • Let’s first assume that the host opens no door. What is the probability that switching to another door in this scenario would prove to be a successful strategy?

      If the prize is behind the original door, switching would succeed 0% of the time: \[P(\text{S} \mid \text{D})=0.\] Note that the prior is \(P(\text{D})=1/3\).

      If the prize is not behind the original door, switching would succeed 50% of the time: \[P(\text{S} \mid \text{D}^c)=1/2.\] Note that the prior is \(P(\text{D}^c)=2/3\). Thus, \[\begin{aligned} P(&\text{S}) = P(\text{S} \mid \text{D}) P(\text{D}) +P(\text{S} \mid \text{D}^c)P(\text{D}^c) \\ & =0\cdot \frac{1}{3}+\frac{1}{2}\cdot \frac{2}{3} \approx 33\%. \quad\end{aligned}\]

    • Now let’s assume that the host opens one of the other two doors to show a rubbish bin. What is the probability that switching to another door in this scenario would prove to be a successful strategy?

      If the prize is behind the original door, switching would succeed 0% of the time: \[P(\text{S} \mid \text{D})=0.\] Note that the prior is \(P(\text{D})=1/3\).

      If the prize is not behind the original door, switching would succeed 100% of the time: \[P(\text{S} \mid \text{D}^c)=1.\] Note that the prior is \(P(\text{D}^c)=2/3\). Thus, \[\begin{aligned} P(&\text{S}) = P(\text{S} \mid \text{D}) P(\text{D}) +P(\text{S} \mid \text{D}^c)P(\text{D}^c) \\&=0\cdot \frac{1}{3}+1\cdot \frac{2}{3} \approx 67\%. \quad\end{aligned}\]

    If no door is opened, switching is not a winning strategy, resulting in success only 33% of the time. If a door is opened, however, switching becomes the winning strategy, resulting in success 67% of the time.

This problem has attracted a lot of attention over the years due to its counter-intuitive result, but there is no paradox when we understand conditional probabilities.

The following simulations may make it easier to see.

Perhaps you would rather see what happens in practice: if you could pit two players against one another (one who never switches and one who always does so) in a series of Monty Hall games, which one would come out on top in the long run?

We start by setting a number of games \(N\) (not too small, or we won’t be able to observe long-run behaviour) and a replicability seed (so that we may all obtain the same results).


Next, for each of game, we will place the prize behind one of the 3 doors: A, B, or C.

locations = sample(c("A","B","C"), N, replace = TRUE) 

We verify that the prize gets placed behind each door roughly 33% of the time:

    A     B     C 
0.302 0.344 0.354 

Le us now obtain a player’s first guess for each game; this guess is completely independent of the actual prize location:

player.guesses = sample(c("A","B","C"), N, replace = TRUE) 

Finally, we create a data frame telling the analyst where the prize actually is, and what door the player has selected as their original guess.

games = data.frame(locations, player.guesses) 
  locations player.guesses
1         B              B
2         B              B
3         A              B
4         C              C
5         A              C
6         A              A

In this example, how often had the player guessed correctly, before a door was opened and they were given a chance to switch?


  333   167 

Is this surprising?

We now initialize the process to find out which door the host opens. For each game, the host opens a door which is not the one selected by the player, nor the one behind which the prize is found.

games$open.door <- NA

for(j in 1:N){
games$open.door[j] <- sample(setdiff(c("A","B","C"), union(games$locations[j],games$player.guesses[j])), 1)

  locations player.guesses open.door
1         B              B         C
2         B              B         C
3         A              B         C
4         C              C         A
5         A              C         B
6         A              A         B

The union() call enumerates the doors that the host cannot open; the setdiff() call finds the complement of the doors that the host cannot open (i.e.: the doors that she can open), and the sample() call picks one of those doors.

If the player never switches, they win whenever they had originally guessed the location of the prize correctly:

games$no.switch.win <- games$player.guess==games$locations

Let’s find which door the player would have selected if they always switched (the door that is neither the location of the prize nor the one they had originally selected):

games$switch.door <- NA

for(j in 1:N){
games$switch.door[j] <- sample(setdiff(c("A","B","C"), union(games$open.door[j],games$player.guesses[j])), 1)

If the player always switches, they win whenever their switched guess is where the prize is located:

games$switch.win <- games$switch.door==games$locations

  locations player.guesses open.door no.switch.win switch.door switch.win
1         B              B         C          TRUE           A      FALSE
2         B              B         C          TRUE           A      FALSE
3         A              B         C         FALSE           A       TRUE
4         C              C         A          TRUE           B      FALSE
5         A              C         B         FALSE           A       TRUE
6         A              A         B          TRUE           C      FALSE

The chances of winning by not switching are thus:


0.666 0.334 

while the chances of winning by switching are:


0.334 0.666 

Pretty wild, eh? Switching IS the better strategy.


E. Ghashim and P. Boily, A Soft Introduction to Bayesian Data Analysis,” Data Science Report Series, 2020.
E. T. Jaynes, Probability Theory: the Logic of Science. Cambridge Press, 2003.
A. Kolmogorov, Foundations of The Theory of Probability. Chelsea Publishing Company, 1933.
Mathematical Association, UK, An Aeroplane’s Guide to A Level Maths.”