18.1 Plausible Reasoning

“A decision was wise, even though it lead to disastrous consequences, if the evidence at hand indicated it was the best one to make; and a decision was foolish, even though it lead to the happiest possible consequences, if it was unreasonable to expect those consequences.” Herodotus, in Antiquity

Consider the following scenario [27]: while walking down a deserted street at night, you hear a security alarm, look across the street, and see a store with a broken window, from which a person wearing a mask crawls out with a bag full of smart phones.

The natural reaction might be to conclude that the person crawling out of the store is stealing merchandise from the store.

It might be the natural reaction, but how do we actually come to this conclusion? It cannot come from a logical deduction based on evidence.336

Indeed, the person crawling out of the store could have been its owner who, upon returning from a costume party, realized that they had misplaced their keys just as a passing truck was throwing a brick in the store window, triggering the security alarm. Perhaps the owner then went into the store to retrieve items before they could be stolen, which is when you happened unto the scene.

But while the original reasoning process is not deductive, it is at least plausible, which in the logical context is called inductive.

Deductive (left) vs. inductive (right) syllogisms.Deductive (left) vs. inductive (right) syllogisms.

Figure 18.1: Deductive (left) vs. inductive (right) syllogisms.

We might also want to use a weaker version of inductive reasoning: let us say that we know that when \(A\) is true, then \(B\) is more plausible, and we also know that \(B\) is true. Then, we conclude that \(A\) is more plausible.

In the scenario described above, if “the person is a thief” (\(A\) is true), you would not be surprised to “see them crawling out of the store with a bag of phones” (\(B\) is plausible). As you do “see them crawling out of the store with a bag of phones” (\(B\) is true), you would therefore not be surprised to find out that “the person is a thief” (\(A\) is plausible).

In deductive reasoning, we work from a cause to possible effects/consequences; in inductive reasoning, we work from observations to possible causes.

Main difference between deductive (left) vs. inductive (right) reasoning.

Figure 18.2: Main difference between deductive (left) vs. inductive (right) reasoning.

Plausibility relies on the notion of “surprise”. In Tom Stoppard’s 1966 play Rosencrantz and Guildenstern are Dead, Rosencrantz flips 92 heads in a row. This result is of course not impossible, but is it plausible? If this happened to you, what would you conclude?

18.1.1 Rules of Probability

Inductive reasoning requires methods to evaluate the validity of various propositions.

In 1763, Thomas Bayes [361] published a paper on the problem of induction, that is, on arguing from the specific to the general. In modern language and notation, Bayes wanted to use binomial data comprising \(r\) successes out of \(n\) attempts to learn about the underlying chance \(\boldsymbol{\theta}\) of each attempt succeeding. Bayes’ key contribution was to use a probability distribution to represent uncertainty about \(\boldsymbol{\theta}\). This distribution represents epistemiological uncertainty, due to lack of knowledge about the world, rather than aleatory (random) probability arising from the essential unpredictability of future events, as may be familiar from games of chance.

In this framework, a probability (plausability) represents a ‘degree-of-belief’ about a proposition; it is possible that the probability of an event will be recorded differently by two different observers, based on the respective background information to which they have access. This Bayesian position was the commonplace view of probabilities in the late 1700s and early 1800s, a view shared by such luminaries as Bernoulli and Laplace.337

Subsequent scholars found this vague and subjective (how can you be sure that my degree-of-belief matches yours?) and they redefined probability of an event as its long-run relative frequency, given infinite repeated trials (the so-called frequentist position).

A forecast calling for rain with 90% probability doesn’t mean the same thing to Bayesians and frequentists:

  • in the Bayesian framework, this means that the forecaster is 90% certain that it will rain;

  • in the frequentist framework, this means that, historically, it rained in 90% of the cases when the conditions were as they currently are.

The Bayesians framework is more aligned with how humans understand probabilities (92 heads in a row probably mean that that the coin is biased, right?), but how can we be certain that the degree-of-belief is a well-defined concept?

As it happens, there is a well-defined way to determine the rules of probability, based on a small list of axioms [27], [362]:

  1. if a conclusion can be reasoned out in more than one way, then every possible way must lead to the same result;

  2. all (known) evidence relevant to a question must be taken into consideration;

  3. equivalent states of knowledge must be assigned the same probabilities;

  4. if we specify how much we believe something is true, we have implicitly specified how much we believe it’s false, and

  5. if we have specified our degree-of-belief in a first proposition, and then our degree-of-belief in a second proposition if we assume the first one is true, then we have implicitly specified our simultaneous degree-of-belief in both propositions being true.

In what follows, we let \(I\) denote relevant background information; \(X\), \(Y\), and \(Y_k\) denote various propositions, and \(-X\) or \(\overline{X}\) denote the negation of proposition \(X\).

The plausibility of \(X\) given \(I\) is denoted by \(P(X\mid I)\); it is a real number whose value can range from 0 (false) to 1 (true). The rules of probability are quite simple:

  • Sum Rule: for all propositions \(X\), \(P(X\mid I)+P(-X\mid I)=1\);

  • Product Rule: for all propositions \(X\), \(Y\), \(P(X,Y\mid I)=P(X\mid Y;I)\times P(Y\mid I)\).

From these two rules, we can also derive two useful corollaries:

  • Bayes’ Theorem: \(P(X\mid Y;I)\times P(Y\mid I)=P(Y\mid X;I)\times P(X\mid I)\) (see next section);

  • Marginalization Rule: \(P(X\mid I)=\sum_{k}P(X,Y_k\mid I)\), where \(\{Y_k\}\) are exhaustive and disjoint (which is to say, \(\sum_kP(Y_k\mid I)=1\) and \(P(Y_j,Y_k\mid I)=0\) for all \(j\neq k\)).

For continuous variables, the marginalization rule becomes \[P(X\mid I)=\int P(X,Y\mid I) dY.\]

Conditional Probabilities

A conditional probability is the probability of an event taking place given that another event occurred.

The conditional probability of \(A\) given \(B\), \(P(A\mid B)\), is defined as \[ P(A\mid B;I) = \frac{P(A, B\mid I)}{P(B\mid I)}=\frac{P(A\cap B\mid I)}{P(B\mid I)}.\] The probability that two events \(A\) and \(B\) both occur simultaneously is obtained by applying the multiplication rule: \[P(A,B\mid I)=P(B \mid I)\times P(A\mid B;I)=P(A\mid I) \times P(B\mid A;I),\] which we recognize as Bayes’ Rule. Classical Example: a family has two puppies that are not twins. What is the probability that the youngest puppy is female given that at least one of the puppies is female? Assume that male and female puppies are equally likely to be born.

Solution: our answer to this question follows a frequentist approach – we generate trials and identify successful events. There are four possibilities: \[\mathcal{U\mid I}=\{MM,MF,FM,FF\}. \]

Let \(A\) and \(B\) be the events that the youngest puppy is female and that at least one puppy is female, respectively; then \[A\mid I=\{FF,MF\}\quad \text{and}\quad B\mid I=\{FF,MF,FM\}, \] and \[P(A\mid B;I) =\frac{P(A\cap B\mid I)}{P(B\mid I)}=\frac{2/4}{3/4}=2/3\] (and not \(1/2\), as one might naively assume).

18.1.2 Bayes’ Theorem

Bayes’ Theorem provides an expression for the conditional probability of \(A\) given \(B\), that is: \[P(A\mid B;I) = \frac{P(B\mid A;I) \times P(A\mid I)}{P(B\mid I)}=\frac{P(B\mid A;I) \times P(A\mid I)}{P(B\mid A;I) \times P(A\mid I)+P(B\mid -A;I) \times P(-A\mid I)},\] where the last equality is a direct application of the Law of Total Probability.

Bayes’ Theorem can be thought of as way of coherently updating our uncertainty in the light of new evidence. The use of a probability distribution as a ‘language’ to express our uncertainty is not an arbitrary choice: it can in fact be determined from deeper principles of logical reasoning or rational behaviour.

Example: consider a medical clinic (in what follows, we drop the explicit dependence on \(I\) to lighten the notation, but it is important to remember that it is there nonetheless).

  • \(A\) could represent the event “Patient has liver disease.” Past data suggests that 10% of patients entering the clinic have liver disease: \(P(A) = 0.10\).

  • \(B\) could represent the litmus test “Patient is alcoholic.” Perhaps 5% of the clinic’s patients are alcoholics: \(P(B) = 0.05\).

  • \(B\mid A\) could represent the scenario that a patient is alcoholic, given that they have liver disease: perhaps we have \(P(B\mid A) = 0.07\), say.

According to Bayes’ Theorem, then, the probability that a patient has liver disease assuming that they are alcoholic is \[P(A\mid B) = \frac{0.07 \times 0.10}{0.05} = 0.14\] While this is a (large) increase over the original 10% suggested by past data, it remains unlikely that any particular patient has liver disease.

Bayes’ Theorem with Multiple Events

Let \(D\) represent some observed data and let \(A\), \(B\), and \(C\) be mutually exclusive (and exhaustive) events conditional on \(D\). Note that \[\begin{aligned} P( D )&= P( A \cap D ) + P( B \cap D )+P(C \cap D) & \\ &= P(D\mid A) P(A) + P(D\mid B) P(B) + P(D\mid C) P(C).& \end{aligned}\] According to Bayes’ theorem, \[\begin{aligned} P( A\mid D )&= \frac{P(D\mid A) P(A)}{P(D)} & \\ &= \frac{P(D\mid A) P(A)}{P(D\mid A) P(A) + P(D\mid B) P(B) + P(D\mid C) P(C)}.& \end{aligned}\] In general, if there are \(n\) exhaustive and mutually exclusive outcomes \(A_{1},..., A_{n}\), we have, for any \(i\in\left\{1,..., n\right\}\): \[P(A_{i}\mid D) = \frac{P(A_{i}) P(D\mid A_{i})}{\sum^{n}_{k=1} P(A_{k}) P(D\mid A_{k})}\] The denominator is simply \(P(D)\), the marginal distribution of the data.

Note that, if the values of \(A_{i}\) are portions of the continuous real line, the sum may be replaced by an integral.

Example: In the 1996 General Social Survey, for males (age 30+):

  • 11% of those in the lowest income quartile were college graduates.

  • 19% of those in the second-lowest income quartile were college graduates.

  • 31% of those in the third-lowest income quartile were college graduates.

  • 53% of those in the highest income quartile were college graduates.

What is the probability that a college graduate falls in the lowest income quartile?

Solution: let \(Q_{i}, i =1, 2, 3, 4\) represent the income quartiles (i.e., \(P(Q_{i}) =0.25\)) and \(D\) represent the event that a male over 30 is a college graduate. Then \[\begin{aligned} P( Q_{1}\mid D )&= \frac{P(D\mid Q_{1}) P(Q_{1})}{\sum^{4}_{k=1} P(Q_{k}) P(D\mid Q_{k})}= \frac{(0.11)(0.25)}{(0.11+0.19+0.31+0.53)(0.25)} = 0.09.& \end{aligned}\]

18.1.3 Bayesian Inference Basics

Bayesian statistical methods start with existing prior beliefs, and update these using data to provide posterior beliefs, which may be used as the basis for inferential decisions: \[\large \underbrace{P( \boldsymbol{\theta} \mid D )}_{\text{posterior}} = \underbrace{P(\boldsymbol{\theta})}_{\text{prior}} \times \underbrace{P(D\mid \boldsymbol{\theta})}_{\text{likelihood}} /\underbrace{P(D)}_{\text{evidence}},\] where the evidence is \[P(D) = \int P(D\mid \boldsymbol{\theta}) P(\boldsymbol{\theta}) d\boldsymbol{\theta} \quad\text{or}\quad P(D)=\sum_{k}P(D \mid A_k)P(A_k), \] where \(\{A_k\}\) is mutually exclusive and exhaustive.

In the vernacular of Bayesian data analysis (BDA),

  • the prior, \(P(\boldsymbol{\theta})\), represents the strength of the belief in \(\boldsymbol{\theta}\) without taking the observed data \(D\) into account;

  • the posterior, \(P( \boldsymbol{\theta} \mid D )\), represents the strength of our belief in \(\boldsymbol{\theta}\) when the observed data \(D\) is taken into account;

  • the likelihood, \(P(D\mid \boldsymbol{\theta})\), is the probability that the observed data \(D\) would be generated by the model with parameter values \(\boldsymbol{\theta}\), and

  • the evidence, \(P(D)\), is the probability of observing the data \(D\) according to the model, determined by summing (or integrating) across all possible parameter values and weighted by the strength of belief in those parameter values.

Central Data Analysis Question

Bayes’ Theorem allows is an essential component of the scientific method and knowledge discovery in general. Indeed, assume that an experiment has been conducted to determine the degree of validity of a particular hypothesis, and that corresponding experimental data has been collected.

The central data analysis question is the following: given everything that was known prior to the experiment, does the collected data support (or invalidate) the hypothesis?

Given everything that was known prior to the experiment, does the collected/observed data support (or invalidate) the hypothesis/presence of a certain condition?

The problem is that this is usually impossible to compute directly. Bayes’ Theorem offers a possible solution: \[\begin{aligned} P(\text{hypothesis} \mid \text{data}; I)&=\frac{P(\text{data} \mid \text{hypothesis}; I)\times P(\text{hypothesis}\mid I)}{P(\text{data}\mid I)} \\ &\propto P(\text{data} \mid \text{hypothesis};I)\times P(\text{hypothesis}\mid I);\end{aligned}\] the hope is that the terms on the right might be easier to compute than those on the left.

The theorem is often presented as \[\text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{evidence}} \propto \text{likelihood} \times \text{prior},\] which is to say that beliefs should be updated in the presence of new information.

Example: “consider a somber example: the September 11 attacks. Most of us would have assigned almost no probability to terrorists crashing planes into buildings in Manhattan when we woke up that morning. But we recognized that a terror attack was an obvious possibility once the first hit the World Trade Center. And we had no doubt we were being attacked once the second tower was hit. Bayes’ Theorem can replicate this result.” [363]

Let \(A\) represent the proposition that a plane crashes into Manhattan skyscrapers. Let \(B\) represent the proposition that terrorists would attack Manhattan skyscrapers; before 2001, most people would only have assigned a miniscule probability to such an event, say \(0.005\%\). There had been two incidents of planes crashing into Manhattan skyscrapers in the previous 25,000 days before September 11, 2001, so we might assign \(P(A\mid -B;I)=0.008\%\).

We could also assign a fairly high probability of a plane hitting a Manhattan skyscraper if terrorists were attacking said skyscrapers, say \(P(A\mid B;I)=95\%\).

After one plane hitting the World Trade Center, our revised estimate of the probability of a terror attack now stands at roughly \(37%\). If a second plane hits the World Trade Center shortly after the first one, the posterior probability of a terror attack now jumps to a whopping \(99.99\%\).

Determining an appropriate prior is a source of considerable controversy. Conservative estimates (uninformative priors) often lead to reasonable results, but in the absence of relevant information, it is suggested to use maximum entropy priors (we shall discuss those again in Section 18.3.4).

The evidence is harder to compute on theoretical grounds – evaluating the probability of observing data requires access to some model as part of \(I\). Note that either that model was good, so there’s no need for a new hypothesis, or that model was bad, so we dare not trust our computation. Thankfully, the evidence is rarely required on problems of parameter estimation: in a nutshell, prior to the experiment, there are numerous competing hypotheses; while the priors and likelihoods will differ, the evidence will not, so it is not needed to differentiate the various hypotheses.

18.1.4 Bayesian Data Analysis

The main characteristic of Bayesian methods is their explicit use of probability for quantifying uncertainty in inferences based on statistical data analysis. The process of Bayesian data analysis (BDA) can be idealized by dividing it into the following 3 steps:

  1. Setting up a full probability model (the prior) – a joint probability distribution for all observable and unobservable quantities in a problem. The model should be consistent with knowledge about the underlying scientific problem and the data collection process (when available).

  2. Conditioning on observed data (new data) – calculating and interpreting the appropriate posterior distribution (i.e., the conditional probability distribution of the unobserved quantities of ultimate interest, given the observed data).

  3. Evaluating the fit of the model and the implications of the resulting posterior distribution (the posterior) – how well does the model fit the data? are the substantive conclusions reasonable? how sensitive are the results to the modeling assumptions made in step 1? Depending on the responses, one can alter or expand the model and repeat the 3 steps.

The essence of Bayesian methods consists in identifying the prior beliefs about what results are likely, and then updating those according to the collected data.

For example, if the current success rate of a gambling strategy is 5%, we may say that it’s reasonably likely that a small strategy modification could further improve that rate by 5 percentage points, but that it is most likely that the change will have little effect, and that it is entirely unlikely that the success rate would shoot up to 30% (after all, it is only a small modification).

As the data start coming in, we start updating our beliefs. If the incoming data points to an improvement in the success rate, we start moving our prior estimate of the effect upwards; the more data we collect, the more confident we are in the estimate of the effect and the further we can leave the prior behind.

The end result is called the posterior – a probability distribution describing the likely effect of the strategy.


E. T. Jaynes, Probability Theory: the Logic of Science. Cambridge Press, 2003.
T. Bayes, “An essay towards solving a problem in the doctrine of chances,” Phil. Trans. of the Royal Soc. of London, vol. 53, pp. 370–418, 1763.
R. T. Cox, Probability, Frequency, and Reasonable Expectation,” American Journal of Physics, vol. 14, no. 1, 1946.
N. Silver, The Signal and the Noise. Penguin, 2012.