## 3.1 Basic Notions

**Probability** theory is the mathematical discipline relating to the numerical description of the likelihood of an event.

### 3.1.1 Sample Spaces and Events

Throughout, we will deal with **random experiments** (e.g., measurements of speed/ weight, number and duration of phone calls, etc.).

For any “experiment,” the **sample space** is defined as the set of all its **possible outcomes**, often denoted by the symbol \({\cal S}\). A sample space can be **discrete** or **continuous**. An **event** is a collection of outcomes from the sample space \({\cal S}\). Events will be denoted by \(A\), \(B\), \(E_1\), \(E_2\), etc.

Toss a fair coin – the corresponding (discrete) sample space is \({\cal S} =\{\mbox{Head, Tail}\}\).

Roll a die – the corresponding (discrete) sample space is \({\cal S}=\left\{ 1,2,3,4,5,6 \right\}\), with various events represented by

rolling an even number: \(\left\{ 2,4,6 \right\}\);

rolling a prime number: \(\left\{ 2,3,5 \right\}\).

Suppose we measure the weight (in grams) of a chemical sample – the (continuous) sample space can be represented by \(\mathcal{S}=(0,\infty)\), the positive half line, and various events by subsets of \(\mathcal{S}\), such as

sample is less than 1.5 grams: \((0,1.5)\);

sample exceeds 5 grams: \((5,\infty)\).

For any events \(A,B\subseteq {\cal S}\):

the

**union**\(A\cup B\) of \(A\) and \(B\) are all outcomes in \({\cal S}\) contained in either \(A\) or \(B\);the

**intersection**\(A\cap B\) of \(A\) and \(B\) are all outcomes in \({\cal S}\) contained in both \(A\) and \(B\);the

**complement**\(A^c\) of \(A\) (sometimes denoted \(\overline{A}\) or \(-A\)) is the set of all outcomes in \({\cal S}\) that are**not**in \(A\).

If \(A\) and \(B\) have no outcomes in common, they are **mutually
exclusive**; which is denoted by $AB=$ (the empty set).
In particular, \(A\) and \(A^{c}\) are always mutually exclusive.^{20}

Roll a die and let \(A=\left\{ 2,3,5 \right\}\) (a prime number) and \(B=\left\{ 3,6 \right\}\) (multiples of \(3\)). Then \(A\cup B=\left\{ 2,3,5,6 \right\}\), \(A\cap B=\left\{3 \right\}\) and \(A^{c}=\left\{ 1,4,6 \right\}\).

\(100\) plastic samples are analyzed for scratch and shock resistance.

**shock resistance**high low **scratch**high \({70}\) \(4\) **resistance**low \(1\) \(25\) If \(A\) is the event that a sample has high shock resistance and \(B\) is the event that a sample has high scratch residence, then \(A\cap B\) consists of \(70\) samples.

### 3.1.2 Counting Techniques

A **two-stage procedure** can be modeled as having \(k\) bags, with \(m_1\)
items in the first bag, …, \(m_k\) items in \(k\)-th bag.

The first stage consists of picking a bag, and the second stage consists of drawing an item out of that bag. This is equivalent to picking one of the \(m_1+\cdots+m_k\) total items.

If all the bags have the same number of items \[m_1=\cdots=m_k=n,\] then there are \(kn\) items in total, and this is the total number of ways the two-stage procedure can occur.

How many ways are there to first roll a die and then draw a card from a (shuffled) \(52-\)card pack?

**Answer:**there are \(6\) ways the first step can turn out, and for each of these (the stages are independent, in fact) there are \(52\) ways to draw the card. Thus there are \(6\times52=312\) ways this can turn out.

How many ways are there to draw two tickets numbered \(1\) to \(100\) from a bag, the first with the right hand and the second with the left hand?

**Answer:**There are \(100\) ways to pick the first number; for*each of these*there are \(99\) ways to pick the second number. Thus \(100\times 99=9900\) ways.

#### Multi-Stage Procedures

A **\(k\)-stage process** is a process for which:

there are \(n_1\) possibilities at stage 1;

regardless of the 1st outcome there are \(n_2\) possibilities at stage 2,

…

regardless of the previous outcomes, there are \(n_k\) choices at stage \(k\).

There are then \[n_1 \times n_2\cdots\times n_k\] total ways the process can turn out.

### 3.1.3 Ordered Samples

Suppose we have a bag of \(n\) billiard balls numbered \(1,\ldots,n\). We can draw an ** ordered sample** of size \(r\) by picking balls from the bag:

**with replacement**, or**without replacement**.

With how many different collection of \(r\) balls can we end up in each of those cases (each is an \(r\)-stage procedure)?

**Key Notion:** all the object (balls) can be differentiated (using
numbers, colours, etc.)

#### Sampling With Replacement (Order Important)

If we replace each ball into the bag after it is picked, then every draw is the same (there are \(n\) ways it can turn out).

According to our earlier result, there are
\[\underbrace{\ n\times n \times \cdots \times n\ }_{\text{$r$ stages}}= n^r\]
ways to select an ordered sample of size \(r\) **with replacement** from a
set with \(n\) objects \(\left\{ 1,2,\ldots,n \right\}\).

#### Sampling Without Replacement (Order Important)

If we **do not** replace each ball into the bag after it is drawn, then
the choices for the second draw depend on the result of the first draw,
and there are only \(n-1\) possible outcomes.

Whatever the first two draws were, there are \(n-2\) ways to draw the third ball, and so on.

Thus there are
\[\underbrace{n\times (n-1)\times\cdots\times (n-r+1)}_{\text{$r$ stages}} =
{_nP_r}\ \ \text{ (common symbol)}\] ways to select an ordered sample of
size \(r\le n\) **without replacement** from a set of \(n\) objects
\(\left\{ 1,2,\ldots,n \right\}\).

#### Factorial Notation

For a positive integer \(n\), write \[n!=n(n-1)(n-2)\cdots1.\] There are two possibilities:

when \(r=n\), \(_nP_r=n!\), and the ordered selection (without replacement) is called a

**permutation**;when \(r<n\), we can write \[\begin{aligned} _nP_r&= \frac{n(n-1)\cdots(n-r+1)}{} \frac {(n-r)\cdots 1} {(n-r)\cdots 1} \\&= \frac{n!}{(n-r)!}=n\times \cdots \times (n-r+1).\end{aligned}\]

By convention, we set \(0!=1\), so that \[_nP_r=\frac{n!}{(n-r)!},\quad \text{for all $r\le n$.}\]

In how many different ways can 6 balls be drawn

*in order*without replacement from a bag of balls numbered 1 to 49?**Answer:**We compute \[_{49}P_6=49\times 48\times 47\times 46\times 45\times 44 = 10,068,347,520.\] This is the number of ways the actual drawing of the balls can occur for Lotto 6/49 in real-time (balls drawn one by one).

How many 6-digits PIN codes can you create from the set of digits \(\{0,1,\ldots,9\}\)?

**Answer:**If the digits may be repeated, we see that \[10\times 10\times 10\times 10\times 10\times 10=10^6=1,000,000.\] If the digits may not be repeated, we have instead \[_{10}P_6=10\times 9\times 8\times 7\times 6\times 5=151,200.\]

### 3.1.4 Unordered Samples

Suppose now that we **cannot** distinguish between different ordered
samples; when we look up the Lotto 6/49 results in the newspaper, for
instance, we have no way of knowing the order in which the balls were
drawn: \[1-2-3-4-5-6\] could mean that the first drawn ball was ball #
\(1\), the second drawn ball was ball # \(2\), etc., but it could also mean
that the first ball drawn was ball # \(4\), the second one, ball # \(3\),
etc., or any other combinations of the first 6 balls.

Denote the (as yet unknown) number of unordered samples of size \(r\) from a set of size \(n\) by \(_nC_r\). We can derive the expression for \(_nC_r\) by noting that the following two processes are equivalent:

take an

**ordered**sample of size \(r\) (there are \(_nP_r\) ways to do this);take an

**unordered**sample of size \(r\) (there are \(_nC_r\) ways to do this)**and then**rearrange (permute) the objects in the sample (there are \(r!\) ways to do this).

Thus \[\begin{aligned}
_nP_r = {_nC_r}\times r!&\implies& _nC_r = \frac{
_nP_r} {r!} = \frac{n!}{(n-r)!\ r!} = \binom n r\,.\end{aligned}\] This
last notation is called a **binomial coefficient** (read as
“\(n\)-choose-\(r\)”) and is commonly used in textbooks.

**Example:** in how many ways can the “Lotto 6/49 draw” be reported in the newspaper (if they are always reported in increasing order)?

**Answer:** this number is the same as the number of *unordered samples* of size 6 (different re-orderings of same 6 numbers are indistinguishable), so \[\begin{aligned} _{49}C_6&=\binom {49}6 =\frac{49\times 48\times 47\times 46\times 45\times 44} {6\times 5\times4\times3\times2\times 1} \\ &= \frac{ 10,068,347,520 }{720} = 13,983,816\,.\end{aligned}\] There exists a variety of binomial coefficient identities, such as \[\begin{aligned} \binom{n}{k}&=\binom{n}{n-k},\quad \text{for all $0\leq k\leq n$}, \\ \sum_{k=0}^n\binom{n}{k}&=2^n, \quad \text{for all $0\leq n$}, \\ \binom{n+1}{k+1}&=\binom{n}{k}+\binom{n}{k+1},\quad \text{for all $0\leq k\leq n-1$} \\ \sum_{j=k}^n\binom{j}{k}&=\binom{n+1}{k+1}, \quad \text{for all $0\leq n$, etc.}.\end{aligned}\]

### 3.1.5 Probability of an Event

For situations where we have a random experiment which has exactly \(N\)
possible **mutually exclusive**, **equally likely** outcomes, we can
assign a probability to an event \(A\) by counting the number of outcomes
that correspond to \(A\) – its **relative frequency**.

If that count is \(a\), then \[P(A) = \frac{a}{N}.\] The probability of each individual outcome is thus \(1/N\).

**Examples:**

Toss a fair coin – the sample space is \({\cal S} = \{\mbox{Head, Tail}\}\), i.e., \(N=2\). The probability of observing a Head on a toss is thus \(\frac{1}{2}.\)

Throw a fair six sided die. There are \(N=6\) possible outcomes. The sample space is \[{\cal S} = \{1,2,3,4,5,6\}.\] If \(A\) corresponds to observing a multiple of \(3\), then \(A=\{3,6\}\) and \(a=2\), so that \[\mbox{Prob(number is a multiple of $3$)} =P(A) =\frac{2}{6} =\frac{1}{3}.\]

The probabilities of seeing an even/odd number are: \[\begin{aligned} \text{Prob$\left\{ \text{even} \right\}$}&=P\left(\left\{ 2,4,6 \right\}\right)=\frac{3}{6}=\frac{1}{2}; \\ \text{Prob$\left\{ \text{prime} \right\}$} &=P\left(\left\{ 2,3,5 \right\}\right)=1-P\left(\left\{ 1,4,6 \right\}\right)=\frac{1}{2}.\end{aligned}\]

In a group of \(1000\) people it is known that \(545\) have high blood pressure. \(1\) person is selected randomly. What is the probability that this person has high blood pressure?

**Answer:**the relative frequency of people with high blood pressure is \(0.545\).

This approach to probability is called the **frequentist
interpretation**. It is based on the idea that the theoretical
probability of an event is given by the behaviour of the empirical
(observed) relative frequency of the event over long-run repeatable and
independent experiments (i.e.when \(N\to \infty\)).

This is the classical definition, and the one used in this document, but
there are competing interpretations which may be more appropriate
depending on the context; chiefly, the **Bayesian interpretation** (see
[26], [27] for details) and the **propensity interpretation**
(introducing causality as a mechanism).

#### Axioms of Probability

The modern definition of probability is **axiomatic** (according to
Kolmogorov’s seminal work [28]).

The **probability of an event** \(A\subseteq \mathcal{S}\) is a numerical
value satisfying the following properties:

for any event \(A\), \(1\ge P(A) \ge 0\);

for the complete sample space \({\cal S}\), \(P({\cal S}) = 1\);

for the empty event $$, \(P(\varnothing )=0\), and

for two

**mutually exclusive**events \(A\) and \(B\), the probability that \(A\) or \(B\) occurs is \(P(A\cup B)=P(A) + P(B).\)

Since \({\cal S}=A\cup A^c\), and \(A\) and \(A^c\) are mutually exclusive, then \[\begin{aligned} 1 &\stackrel{\textbf{A2}}= P\left( {\cal S} \right) =P\left( A\cup A^c \right) \stackrel{\textbf{A4}}=P(A)+P\left( A^c \right) \\ &\implies {P(A^c)=1-P(A)}.\end{aligned}\]

**Examples:**

Throw a single six sided die and record the number that is shown. Let \(A\) and \(B\) be the events that the number is a multiple of or smaller than \(3\), respectively. Then \(A = \{3,6\}\), \(B = \{1,2\}\) and \(A\) and \(B\) are mutually exclusive since $AB=$. Then \[P(A \mbox{ or } B) = P(A\cup B)=P(A) + P(B) = \frac{2}{6}+\frac{2}{6} = \frac{2}{3}.\]

An urn contains \(4\) white balls, \(3\) red balls and \(1\) black ball. Draw one ball, and denote the following events by \(W =\{\mbox{the ball is white}\}\), \(R =\{\mbox{the ball is red}\}\) and \(B =\{\mbox{the ball is black}\}\). Then \[P(W)=1/2,\quad P(R)=3/8,\quad P(B)=1/8,\] and \(P(W\mbox{ or } R)=7/8.\)

#### General Addition Rule

This useful rule is a direct consquence of the axioms of probability: \[P(A\cup B) = P(A)+P(B)-P(A\cap B).\]

**Example:** an electronic gadget
consists of two components, \(A\) and \(B\). We know from experience that
\(P( \text{$A$
fails})= 0.2\), \(P( \text{$B$ fails} )= 0.3\) and
\(P( \text{both $A$ and $B$ fail} )= 0.15\). Find
\(P( \text{at least one of $A$ and $B$ fails} )\) and
\(P( \text{neither $A$ nor $B$ fails} )\).

**Answer:** write \(A\) for “\(A\) fails” and similarly for \(B\). Then we are
looking to compute \[\begin{aligned}
P( \text{at least one fails} ) &= P(A\cup B) \\ &=
P(A)+P(B)-P(A\cap B) = 0.35\,;\\ P( \text{neither fail})&= 1-P( \text{at least one fails} )=0.65\,.\end{aligned}\]
If \(A\), \(B\) are mutually exclusive, \(P(A\cap B) =P(\varnothing )=0\) and
\[P(A\cup B)=P(A)+P(B)-P(A\cap B)=P(A)+P(B).\] With three events, the
addition rule expands as follows: \[\begin{aligned}
P(A\cup B\cup C)=& P(A)+P(B)+P(C)\\
& \quad -P(A\cap B)-P(A\cap C)-P(B\cap C) \\ & \quad +P(A\cap B\cap C).\end{aligned}\]

### 3.1.6 Conditional Probability and Independent Events

Any two events \(A\) and \(B\) satisfying
\[P\left( A\cap B \right) = P(A)\times P(B)\] are said to be
**independent**.^{21} When events are not independent, we say that they
are **dependent** or **conditional**.

Mutual exclusivity and independence are unrelated concepts. The only way
for events \(A\) and \(B\) to be mutually exclusive **and** independent is
for either \(A\) or \(B\) (or both) to be a non-event (the empty event):
\[\begin{aligned}
\varnothing &=P(A \cap B)=P(A)\times P(B) \implies P(A)=0\text{ or }P(B)=0 \\ & \implies A=\varnothing \text{ or }B=\varnothing .\end{aligned}\]

**Examples:**

Flip a

**fair**coin twice – the 4 possible outcomes are all equally likely: \(\mathcal{S}=\{HH,HT,TH,TT\}\). Let \[A=\{HH\}\cup \{HT\}\] denote “head on first flip”, \(B=\{HH\}\cup \{TH\}\) “head on second flip”. Note that \(A\cup B\neq \mathcal{S}\) and \(A\cap B=\{HH\}\). By the general addition rule, \[\begin{aligned} P\left( A \right) & = P(\{HH\})+P(\{HT\})-P(\{HH\}\cap \{HT\}) \\ &= \frac14+\frac14-P(\varnothing ) = \frac{1}{2}-0 =\frac12\,.\end{aligned}\] Similarly, \(P\left( B \right) = P(\{HH\})+P(\{TH\}) = \frac12\), and so \(P(A)P(B)=\frac14\). But \(P(A\cap B) = P(\{HH\})\) is also \(\frac14\), so \(A\) and \(B\) are independent.

A card is drawn from a regular well-shuffled 52-card North American deck. Let \(A\) be the event that it is an ace and \(D\) be the event that it is a diamond. These two events are independent. Indeed, there are \(4\) aces \[P(A)=\frac{4}{52}=\frac{1}{13}\] and \(13\) diamonds \[P(D)=\frac{13}{52}=\frac{1}{4}\] in such a deck, so that \[P(A)P(D)=\frac{1}{13}\times\frac{1}{4}=\frac{1}{52}\,,\] and exactly \(1\) ace of diamonds in the deck, so that \(P(A \cap D)\) is also \(\frac{1}{52}\).

A six-sided die numbered \(1-6\) is loaded in such a way that the probability of rolling each value is

*proportional*to that value. Find \(P(3)\).**Answer:**Let \(\mathcal{S}=\{1,2,3,4,5,6\}\) be the value showing after a single toss; for some proportional constant \(v\), we have \(P(k)=kv\), for \(k\in \mathcal{S}\). By Axiom \(\textbf{A2}\), \(P(\mathcal{S})=P(1)+\cdots+P(6)=1\), so that \[1=\sum_{k=1}^6P(k)=\sum_{k=1}^6kv=v\sum_{k=1}^6k=v\frac{(6+1)(6)}{2}=21v\,.\] Hence \(v=1/21\) and \(P(3)=3v=3/21=1/7\).Now the die is rolled twice, the second toss

*independent*of the first. Find \(P(3_1,3_2)\).**Answer:**the experiment is such that \(P(3_1)=1/7\) and \(P(3_2)=1/7\), as seen in the previous example. Since the die tosses are independent,^{22}then \[P\left(3_1\cap3_2 \right) =P(3_1)P(3_2)=1/49\,.\]Is a 2-engine plane more likely to be forced down than a 3-engine plane?

**Answer:**this question is easier to answer if we assume that**engines fail independently**(this is no doubt convenient, but the jury is still out as to whether it is realistic). In what follows, let \(p\) be the probability that an engine fails.^{23}The next step is to decide what type engine failure will force a plane down:

^{24}A 2-engine plane will be forced down if both engines fail – the probability is \(p^2\);

A 3-engine plane will be forced down if any pair of engines fail, or if all 3 fail.

**Pair:**the probability that exactly 1 pair of engines will fail independently (i.e., two engines fail and one does not) is \[p\times p\times (1 - p).\] The order in which the engines fail does not matter: there are \(_3C_2=\frac{3!}{2!1!}=3\) ways in which a pair of engines can fail: for 3 engines A, B, C, these are AB, AC, BC.**All 3:**the probability of all three engines failing independently is \(p^3\).

The probability \(\geq 2\) engines failing is thus \[P(2+ \text{ engines fail})=3p^2(1 - p) + p^3 = 3p^2 - 2p^3.\]

Basically it’s safer to use a 2-engine plane than a 3-engine plane: the 3-engine plane will be forced down more often, assuming it needs \(2\) engines to fly.

This “makes sense”: the 2-engine plane need 50% of its engines working, while the 3-engine plane needs 66% (see Figure 3.1 to get a sense of what the probabilities are for \(0\leq p\leq 1\)).

(Taken from [29]) Air traffic control is a safety-related activity – each piece of equipment is designed to the highest safety standards and in many cases duplicate equipment is provided so that if one item fails another takes over.

A new system is to be provided passing information from Heathrow Airport to Terminal Control at West Drayton. As part of the system design a decision has to be made as to whether it is necessary to provide duplication. The new system takes data from the

*Ground Movements Radar*(GMR) at Heathrow, combines this with data from the*National Airspace System**NAS*, and sends the output to a display at*Terminal Control*(a conceptual model is shown in Figure 3.2).For all existing systems, records of failure are kept and an experimental probability of failure is calculated annually using the previous 4 years.

The

**reliability**of a system is defined as \(R=1 - P\), where \(P=P(\text{failure})\).\(R_{\text{GMR}} = R_{\text{NAS}} = 0.9999\) (i.e., \(1\) failure in \(10,000\) hours).

the components’ failure probabilities are independent.

For the system above, if a single module is introduced the reliability of the system (STD –

**single thread design**) is \[R_{\text{STD}}=R_{\text{GMR}} \times R_{\text{NEW}} \times R_{\text{NAS}}.\]If the module is duplicated, the reliability of this

**dual thread design**is \[R_{\text{DTD}}=R_{\text{GMR}} \times (1 - (1 - R_{\text{NEW}} )^2) \times R_{\text{NAS}}.\] Duplicating the module causes an improvement in reliability of \[\rho=\frac{R_{\text{DTD}}}{R_{\text{STD}}}=\frac{(1 - (1 - R_{\text{NEW}} )^2)}{R_{\text{NEW}}} \times 100\%\,.\]For the module, no historical data is available. Instead, we work out the improvement achieved by using the dual thread design for various values of \(R_{\text{NEW}}\).

\(R_{\text{NEW}}\) 0.1 0.2 0.5 0.75 0.99 0.999 0.9999 0.99999 \(\rho\) (%) 190 180 150 125 101 100.1 100.01 100.001 If the module is very unreliable (i.e., \(R_{\text{NEW}}\) is small), then there is a significant benefit in using the dual thread design (\(\rho\) is large).

^{25}If the new module is as reliable as and , that is, if \[R_{\text{GMR}} =R_{\text{NEW}} = R_{\text{NAS}} = 0.9999,\] then the single thread design has a combined reliability of \(0.9997\) (i.e., \(3\) failures in \(10,000\) hours), whereas the dual thread design has a combined reliability of \(0.9998\) (i.e., \(2\) failures in \(10,000\) hours).

If the probability of failure is independent for each component, we could conclude from this that the reliability gain from a dual thread design probably does not justify the extra cost.

In the last two examples, we had to make **additional assumptions** in
order to answer the questions – this is often the case in practice.

#### Conditional Probability

It is easier to understand independence of events through the
**conditional probability** of an event \(B\) given that another event \(A\)
has occurred, defined by as \[\begin{aligned}
{P(B \mid A)}=\frac{P(A\cap B)}{P(A)}\,.\end{aligned}\] Note that this
definition only makes sense when “\(A\) can happen” i.e., \(P(A)>0\). If
\(P(A)P(B)>0\), then
\[P(A\cap B)=P(A)\times P(B \mid A)=P(B)\times P(A \mid B)=P(B\cap A);\]
\(A\) and \(B\) are thus independent if \(P(B \mid A)=P(B)\) and
\(P(A \mid B)=P(A)\).

**Examples:**

From a group of 100 people, 1 is selected. What is the probability that this person has high blood pressure (HBP)?

**Answer:**if we know nothing else about the population, this is an**(unconditional) probability**, namely \[P(\text{HBP})=\frac{\# \text{individuals with HBP in the population}}{100}\,.\]If instead we first filter out all people with low cholesterol level, and then select 1 person. What is the probability that this person has HBP?

**Answer:**this is the**conditional probability**\[P(\text{HBP} \mid \text{high cholesterol});\] the probability of selecting a person with HBP, given high cholesterol levels, presumably different from \(P(\text{HBP} \mid \text{low cholesterol})\).

A sample of \(249\) individuals is taken and each person is classified by blood type and tuberculosis (TB) status.

**O****A****B****AB**Total TB \(34\) \(37\) \(31\) \(11\) \(113\) no TB \(55\) \(50\) \(24\) \(7\) \(136\) Total \(89\) \(87\) \(55\) \(18\) \(249\) The (unconditional) probability that a random individual has TB is \(P(\text{TB})=\frac{\# \text{TB}}{249}=\frac{113}{249} = 0.454\). Among those individuals with type

**B**blood, the (conditional) probability of having TB is \[P(\text{TB} \mid \text{type {\bf B}})= \frac{P(\text{TB}\cap \text{type {\bf B}})}{P(\text{type {\bf B}})}=\frac{31}{55} = \frac{31/249}{55/249}= 0.564.\]A family has two children (not twins). What is the probability that the youngest child is a girl given that at least one of the children is a girl? Assume that boys and girls are equally likely to be born.

**Answer:**let \(A\) and \(B\) be the events that the youngest child is a girl and that at least one child is a girl, respectively: \[A=\{\text{GG},\text{BG}\} \quad\mbox{and}\quad B=\{\text{GG},\text{BG},\text{GB}\},\] \(A\cap B=A.\) Then \(P(A \mid B)=\frac{P(A\cap B)}{P(B)} = \frac{P(A)}{P(B)}=\frac{2/4}{3/4}=\frac{2}{3}\) (and not \(\frac12\), as might naively be believed).

Incidentally, \(P(A\cap B)=P(A)\neq P(A)\times P(B)\), which means that \(A\) and \(B\) are**not**independent events.

#### Law of Total Probability

Let \(A\) and \(B\) be two events. From set theory, we have \[B=(A\cap B)\cup (\overline{A}\cap B),\] as illustrated in Figure 3.3.

Note that \(A\cap B\) and \(\overline{A}\cap B\) are mutually exclusive, so
that, according to Axiom \(\textbf{A4}\), we have
\[P(B)=P(A\cap B)+P(\overline{A}\cap B).\] Now, assuming that
\(\varnothing \neq A \neq \mathcal{S}\),
\[P(A\cap B) = P(B \mid A)P(A) \quad\mbox{and} \quad P(\overline{A}\cap B)=P(B \mid \overline{A})P(\overline{A}),\]
so that
\[P(B) = P(B \mid A)P(A)+P(B \mid \overline{A})P(\overline{A}).\] This
generalizes as follows: if \(A_1,...A_k\) are **mutually exclusive** and
**exhaustive** (i.e., $A_iA_j=$ for all \(i\neq j\) and
\(A_1\cup ....\cup A_k=\mathcal{S})\), then for any event \(B\)
\[\begin{aligned}
P(B)&=\sum_{j=1}^kP(B \mid A_j)P(A_j)\\&=P(B \mid A_1)P(A_1
)+...+P(B \mid A_k)P(A_k).\end{aligned}\]

**Example:** use the **Law of Total Probability** (rule above) to compute \(P(\text{TB})\) using the data from the previous example.

**Answer:** the blood types \(\{\textbf{O},\textbf{A},\textbf{B},\textbf{AB}\}\) form
a mutually exclusive partition of the population, with
\[P(\textbf{O})=\frac{89}{249},\ P(\textbf{A})=\frac{87}{249},\ P(\textbf{B})=\frac{55}{249},\ P(\textbf{AB})=\frac{18}{249}.\]
It is easy to see that
\(P(\textbf{O})+ P(\textbf{A})+P(\textbf{B})+ P(\textbf{AB})=1\).
Furthermore, \[\begin{aligned}
P(\text{TB} \mid \textbf{O})&=\textstyle{\frac{P(\text{TB}\cap\textbf{O})}{P(\textbf{O})}=\frac{34}{89},\ P(\text{TB} \mid \textbf{A})=\frac{P(\text{TB}\cap\textbf{A})}{P(\textbf{A})}=\frac{37}{87}},\\ P(\text{TB} \mid \textbf{B})&=\textstyle{\frac{P(\text{TB}\cap\textbf{B})}{P(\textbf{B})}=\frac{31}{55},\ P(\text{TB} \mid \textbf{AB})=\frac{P(\text{TB}\cap\textbf{AB})}{P(\textbf{AB})}=\frac{11}{18}}.\end{aligned}\]

According to the law of total probability, \[\begin{aligned} P(\text{TB})=P(\text{TB} &\mid \textbf{O})P(\textbf{O})+P(\text{TB} \mid \textbf{A})P(\textbf{A})\\ & +P(\text{TB} \mid \textbf{B})P(\textbf{B})+P(\text{TB} \mid \textbf{AB})P(\textbf{AB}), \end{aligned}\] so that \[\begin{aligned} P(\text{TB})&=\frac{34}{89}\cdot\frac{89}{249}+\frac{37}{87}\cdot\frac{87}{249}+\frac{31}{55}\cdot\frac{55}{249}+\frac{11}{18}\cdot\frac{18}{249}\\ &=\frac{34+37+31+11}{249}=\frac{113}{249}=0.454, \end{aligned}\] which matches with the result of the previous example.

### 3.1.7 Bayes’ Theorem

After an experiment generates an outcome, we are often interested in the probability that a certain condition was present given an outcome (or that a particular hypothesis was valid, say).

We have noted before that if \(P(A)P(B)>0\), then
\[P(A\cap B)=P(A)\times P(B \mid A)=P(B)\times P(A \mid B)=P(B\cap A);\]
this can be re-written as **Bayes’ Theorem**:
\[P(A \mid B)=\frac{P(B \mid A) \times P(A)}{P(B)}.\] Bayes’ Theorem is
a powerful tool in probability analysis, but it is a simple corollary of
the rules of probability.

#### Central Data Analysis Question

Given everything that was known prior to the experiment, does the collected/observed data support (or invalidate) the hypothesis/presence of a certain condition?

The **problem** is that this is usually impossible to compute directly.
Bayes’ Theorem offers a **possible solution**: \[\begin{aligned}
P(\text{hypothesis} \mid \text{data})&=\frac{P(\text{data} \mid \text{hypothesis})\times P(\text{hypothesis})}{P(\text{data})} \\
&\propto P(\text{data} \mid \text{hypothesis})\times P(\text{hypothesis}),\end{aligned}\]
in which the terms on the right might be easier to compute than the term
on the left.

#### Bayesian Vernacular

In Bayes’ Theorem:

\(P(\text{hypothesis})\) is the probability of the hypothesis being true prior to the experiment (called the

**prior**);\(P(\text{hypothesis} \mid \text{data})\) is the probability of the hypothesis being true once the experimental data is taken into account (called the

**posterior**);\(P(\text{data} \mid \text{hypothesis})\) is the probability of the experimental data being observed assuming that the hypothesis is true (called the

**likelihood**).

The theorem is often presented as
\(\text{posterior} \propto \text{likelihood} \times \text{prior}\), which
is to say, **beliefs should be updated in the presence of new
information**.

#### Formulations

If \(A,B\) are events for which \(P(A)P(B)>0\), then Bayes’ Theorem can be
re-written, using the law of total probability, as
\[P(A \mid B)=\frac{P(B \mid A)P(A)}{P(B)}=\frac{P(B \mid A)P(A)}{P(B \mid A)P(A)+P(B \mid \overline{A})P(\overline{A})},\]
or, in the general case where \(A_1,...A_k\) are **mutually exclusive**
and **exhaustive** events, then for any event \(B\) and for each
\(1\leq i\leq k\), \[\begin{aligned}
P(A_i \mid B)&= \frac{P(B \mid A_i)P(A_i)}{P(B)}\\ &=\frac{P(B \mid A_i)P(A_i)}{P(B \mid A_1)P(A_1
)+...+P(B \mid A_k)P(A_k)}.\end{aligned}\]

**Examples:**

In 1999, Nissan sold three car models in North America: Sentra (S), Maxima (M), and Pathfinder (PA). Of the vehicles sold that year, \(50\%\) were S, \(30\%\) were M and \(20\%\) were PA. In the same year \(12\%\) of the S, \(15\%\) of the M, and \(25\%\) of the PA had a particular defect \(D\).

If you own a 1999 Nissan, what is the probability that it has the defect?

in the language of conditional probability, \[\begin{aligned} P(\text{S})&=0.5, \ P(\text{M})=0.3, \ P(\text{Pa})=0.2, \\ P(D \mid \text{S})&=0.12, \ P(D \mid \text{M})=0.15, \ P(D \mid \text{PA})=0.25, \end{aligned}\]so that \[\begin{aligned} P(&D)=P(D \mid \text{S})\times P(\text{S})+P(D \mid \text{M})\times P(\text{M})\\&+P(D \mid \text{Pa})\times P(\text{Pa}) \\ &=0.12\cdot 0.5+0.15\cdot 0.3+0.25\cdot 0.2\\&= 0.155 = 15.5\%.\end{aligned}\]

If a 1999 Nissan has defect \(D\), what model is it likely to be?

in the first part we computed the total probability \(P(D)\); in this part, we compare the posterior probabilities \(P(\text{M} \mid D)\), \(P(\text{S} \mid D)\), and \(P(\text{Pa} \mid D)\) (and not the priors!), computed using Bayes’ Theorem: \[\begin{aligned} P(\text{S} \mid D)&=\textstyle{\frac{P(D \mid \text{S})P(\text{S})}{P(D)}=\frac{0.12\times 0.5}{0.155}\approx 38.7\%} \\ P(\text{M} \mid D)&=\textstyle{\frac{P(D \mid \text{M})P(\text{M})}{P(D)}=\frac{0.15\times 0.3}{0.155}\approx 29.0\%} \\ P(\text{Pa} \mid D)&=\textstyle{\frac{P(D \mid \text{Pa})P(\text{Pa})}{P(D)}=\frac{0.25\times 0.2}{0.155}\approx 32.3\%} \end{aligned}\] Even though Sentras are the least likely to have the defect \(D\), their overall prevalence in the population carry them over the hump.

Suppose that a test for a particular disease has a very high success rate. If a patient

has the disease, the test reports a ‘positive’ with probability 0.99;

does not have the disease, the test reports a ‘negative’ with prob 0.95.

Assume that only 0.1% of the population has the disease. What is the probability that a patient who tests positive does not have the disease?

**Answer:**Let \(D\) be the event that the patient has the disease, and \(A\) be the event that the test is positive. The probability of a true positive is \[\begin{aligned} P(D \mid A)&=\frac{P(A \mid D)P(D)}{P(A \mid D)P(D)+P(A \mid D^c)P(D^c)}\\ &=\frac{0.99\times 0.001}{0.99\times 0.001 + 0.05\times 0.999}\approx 0.019.\end{aligned}\] The probability of a false positive is thus \(1-0.019\approx 0.981\). Despite the apparent high accuracy of the test, the incidence of the disease is so low (\(1\) in a \(1000\)) that the vast majority of patients who test positive (\(98\) in \(100\)) do not have the disease.The \(2\) in \(100\) who are true positives still represent \(20\) times the proportion of positives found in the population (before the outcome of the test is known).[It is important to remember that when dealing with probabilities,

**both**the likelihood and the prevalence have to be taken into account.](

**Monty Hall Problem**) On a game show, you are given the choice of three doors. Behind one of the doors is a prize; behind the others, dirty and smelly rubbish bins (as is skillfully rendered in Figure 3.4).You pick a door, say No. 1, and the host, who knows what is behind the doors, opens another door, say No. 3, behind which is a bin. She then says to you, “Do you want to switch from door No. 1 to No. 2?”

Is it to your advantage to do so?

**Answer:**in what follows, let and be the events that switching to another door is a successful strategy and that the prize is behind the original door, respectively.Let’s first assume that the host opens no door. What is the probability that switching to another door in this scenario would prove to be a successful strategy?

If the prize is behind the original door, switching would succeed 0% of the time: \[P(\text{S} \mid \text{D})=0.\] Note that the prior is \(P(\text{D})=1/3\).

If the prize is not behind the original door, switching would succeed 50% of the time: \[P(\text{S} \mid \text{D}^c)=1/2.\] Note that the prior is \(P(\text{D}^c)=2/3\). Thus, \[\begin{aligned} P(&\text{S}) = P(\text{S} \mid \text{D}) P(\text{D}) +P(\text{S} \mid \text{D}^c)P(\text{D}^c) \\ & =0\cdot \frac{1}{3}+\frac{1}{2}\cdot \frac{2}{3} \approx 33\%. \quad\end{aligned}\]

Now let’s assume that the host opens one of the other two doors to show a rubbish bin. What is the probability that switching to another door in this scenario would prove to be a successful strategy?

If the prize is behind the original door, switching would succeed 0% of the time: \[P(\text{S} \mid \text{D})=0.\] Note that the prior is \(P(\text{D})=1/3\).

If the prize is not behind the original door, switching would succeed 100% of the time: \[P(\text{S} \mid \text{D}^c)=1.\] Note that the prior is \(P(\text{D}^c)=2/3\). Thus, \[\begin{aligned} P(&\text{S}) = P(\text{S} \mid \text{D}) P(\text{D}) +P(\text{S} \mid \text{D}^c)P(\text{D}^c) \\&=0\cdot \frac{1}{3}+1\cdot \frac{2}{3} \approx 67\%. \quad\end{aligned}\]

If no door is opened, switching is not a winning strategy, resulting in success only 33% of the time. If a door is opened, however, switching becomes the winning strategy, resulting in success 67% of the time.

This problem has attracted a lot of attention over the years due to its counter-intuitive result, but there is no paradox when we understand conditional probabilities.

The following simulations may make it easier to see.

Perhaps you would rather see what happens in practice: if you could pit two players against one another (one who never switches and one who always does so) in a series of Monty Hall games, which one would come out on top in the long run?

We start by setting a number of games \(N\) (not too small, or we won’t be able to observe long-run behaviour) and a replicability seed (so that we may all obtain the same results).

Next, for each of game, we will place the prize behind one of the 3 doors: A, B, or C.

We verify that the prize gets placed behind each door roughly 33% of the time:

```
locations
A B C
0.302 0.344 0.354
```

Le us now obtain a player’s first guess for each game; this guess is completely independent of the actual prize location:

Finally, we create a data frame telling the analyst where the prize actually is, and what door the player has selected as their original guess.

```
locations player.guesses
1 B B
2 B B
3 A B
4 C C
5 A C
6 A A
```

In this example, how often had the player guessed correctly, before a door was opened and they were given a chance to switch?

```
FALSE TRUE
333 167
```

Is this surprising?

We now initialize the process to find out which door the host opens. For each game, the host opens a door which is not the one selected by the player, nor the one behind which the prize is found.

```
games$open.door <- NA
for(j in 1:N){
games$open.door[j] <- sample(setdiff(c("A","B","C"), union(games$locations[j],games$player.guesses[j])), 1)
}
head(games)
```

```
locations player.guesses open.door
1 B B C
2 B B C
3 A B C
4 C C A
5 A C B
6 A A B
```

The `union()`

call enumerates the doors that the host cannot open; the `setdiff()`

call finds the complement of the doors that the host cannot open (i.e.: the doors that she can open), and the `sample()`

call picks one of those doors.

If the player never switches, they win whenever they had originally guessed the location of the prize correctly:

Let’s find which door the player would have selected if they always switched (the door that is neither the location of the prize nor the one they had originally selected):

```
games$switch.door <- NA
for(j in 1:N){
games$switch.door[j] <- sample(setdiff(c("A","B","C"), union(games$open.door[j],games$player.guesses[j])), 1)
}
```

If the player always switches, they win whenever their switched guess is where the prize is located:

```
locations player.guesses open.door no.switch.win switch.door switch.win
1 B B C TRUE A FALSE
2 B B C TRUE A FALSE
3 A B C FALSE A TRUE
4 C C A TRUE B FALSE
5 A C B FALSE A TRUE
6 A A B TRUE C FALSE
```

The chances of winning by not switching are thus:

```
FALSE TRUE
0.666 0.334
```

while the chances of winning by switching are:

```
FALSE TRUE
0.334 0.666
```

Pretty wild, eh? Switching IS the better strategy.

### References

*Data Science Report Series*, 2020.

*Probability Theory: the Logic of Science*. Cambridge Press, 2003.

*Foundations of The Theory of Probability*. Chelsea Publishing Company, 1933.