## 18.2 Examples

We take a look at three scenarios that will shed some light on the whole Bayesian entreprise:

determining if a coin is fair (or not),

finding a link between demographic information and salary, and

estimating the number of dollar bills in circulation.

These examples will show how priors, likelihood, and posteriors interact.

### 18.2.1 The Mysterious Coin

A mysterious stranger brings back a souvenir coin from a trip to a strange and distant land. They have been flipping it pretty much non-stop since their return. You can see the proportion of heads they obtained for 4, 8, and 16 tosses.

It might seem at first that the coin might be biased, but the proportion of heads seems to inch its way towards 50%. What is truly going on?

#### Priors

Perhaps the coin is not fair, coming as it does from a strange and distant land. Let us denote the coin’s **bias** by \(H\), i.e., the probability of flipping a head on a toss (\(H\approx 0.5\): regular unbiased coins; \(H\approx 0, 1\): highly biased coins). A **prior** for this scenario is a probability density function (pdf) \[P(\text{bias}=H)=P(H\mid I).\] Four such priors are shown below.

Why are we working with functions for the prior, when in the previous example (Sept. 11 attacks), we only provided a number, \(P(B\mid I)=0.005\%\)? In fact, we had provided a (discrete) function as a prior: \[P(B=x\mid I)=\begin{cases} 0.005\% &\text{if }x=\text{TRUE} \\ 99.995\% &\text{if }x=\text{FALSE} \end{cases} \]

#### Likelihood

Let us assume that the coin has been tossed \(N\) times in total, and that \(K\) heads have been recorded. In this scenario, Bayes’ Theorem takes the form:

\[P(\text{bias}=H\mid K \text{ heads}, N \text{ tosses};I)\propto P(K \text{ heads}, N \text{ tosses} \mid \text{bias}=H;I) \times P(\text{bias}=H \mid I).\]

The likelihood is the probability of observing \(K\) heads in \(N\) tosses if the bias is \(H\). If, as part of \(I\), the tosses are independent (i.e., the result of one toss does not affect the others), then the likelihood is given by the binomial distribution \[ P(K \text{ heads}, N \text{ tosses} \mid \text{bias}=H;I) = \binom{N}{K}H^K(1-H)^{N-K} .\]

#### Posteriors

Combining the prior and the likelihood, we get: \[ P(\text{bias}=H \mid K \text{ heads}, N \text{ tosses};I) \propto H^K(1-H)^{N-K}\times P_i(\text{bias}=H\mid I),\] where \(i\) indexes the various prior scenarios described above.

We should thus be able to estimate the bias \(H^*\) by studying the posterior distribution for each of the 4 priors, for various number of throws \(N\).

- With the
**non-committal prior**(blue pdf) \(P_1(\text{bias}=H\mid I) \propto 1\), the posterior is simply proportional to the likelihood; the central limit theorem seems to kick in after \(\approx 30\) tosses.

- With the
**foul play prior**(green pdf), we suspect early on that the bias is smaller than 0.5; the subsequent series of tosses moves the bias to a value \(0.25\leq H^*\leq 0.40\) quickly, as was the case with the non-informative prior. Note the shrinking of the posterior with an increasing number of tosses.

- With the
**regular coin prior**(orange pdf) \(P_1(\text{bias}=H\mid I) \sim \mathcal{N}(0.5,\sigma^2)\), early results do not strongly suggest that the coin is biased (the prior gives little credence to the notion that the bias could lie in \(0.25\leq H^*\leq 0.40\)), but the series of tosses forces the posterior to a biased distribution (note the smoother convergence of the posterior).

- With the
**doubtful prior**(yellow pdf), the competing hypotheses compete before converging to a bias, again in \(0.25\leq H^*\leq 0.40\). The convergence is more haphazard: as soon as one head or one tail is observed, the process nixes the two-sided coin option. Note the slower (and weirder) convergence to a gaussian posterior.

In the fair coin example, it would seem that the choice of a prior does not have much of an effect on the posterior … **given enough data**. This will not always be the case.

### 18.2.2 The Salary Question

Income information has been collected for 4782 individuals, together with demographic details: self-reported gender, age group, and education level (1 for post-secondary degree, 0 otherwise). The table below shows some of the summary statistics for the dataset; the dataset is available in `Salary.xlsx`

.

**Question:** is there a link between demographic information and income? How would we answer this question using classical statistical methods? What if we had reason to suspect that reported incomes follow a (potentially) different distribution for each group? Would that change the approach?

In the Bayesian framework, we would be interested in the posterior distribution \[P(\text{parameters}\mid \text{data};i,I),\quad i=1,\ldots, 12.\] If we assume (for no particular good reason) that the reported incomes are **normally distributed** for each group, then we seek \[P(\mu_i,\sigma_i\mid \text{reported salaries in group $i$};I),\quad i=1,\ldots, 12.\]

#### Priors

Determining a reasonable collection of **priors** \(P(\mu_i,\sigma_i\mid I)\), for \(i=1,\ldots, 12\), is no easy task. One could naively pick a joint distribution which “**peaks**” at the sample mean \(\overline{x}_i\), with standard deviation \(s_i\), for each group \(i\), but there are sampling design issues associated with this approach.

Why not select, instead, a prior “which expresses **complete ignorance** except for the fact that \(\mu_i\) is a **location** parameter and \(\sigma_i\) is a **scale** parameter” [27], [364]. This translates into using a prior \[P_1(\mu_i,\sigma_i\mid I) \propto \sigma_i^{-1},\quad i=1,\ldots,12\] (we will discuss non-informative priors in the next section). For comparison’s sake, we will also consider the prior \[P_2(\mu_i,\sigma_i\mid I) \propto \mu_i^{500}\sigma_i^{-4},\quad i=1,\ldots,12.\]

What could those priors represent, in the real world? What happens to the probabilities when \(\sigma_i\) increases? When \(\mu_i\) increases? Note, as well, that these ”priors” are not normalizable over the positive quadrant in \((\mu,\sigma)-\)space.^{338} Instead, we could only consider them over a suitable finite sub-region; or use the fact that the product of the likelihood and the prior IS normalizable.

#### Likelihood

Let’s denote the number of observations in group \(i\) by \(N_i\). The likelihood is the probability \[P(\text{reported incomes }\{x_{k,i}\} \text{ in group } i \mid \mu_i, \sigma_i; I),\quad i=1,\ldots,12.\]

We have assumed normality for any given observation. If we assume further that all observations are independent, then \[P(\{x_{k,i}\}\mid \mu_i, \sigma_i; I) \propto \prod_{k=1}^{N_i} \sigma_i^{−1} \exp\left(\frac{-(\mu_i-x_{k,i})^2}{2\sigma_i^2}\right),\quad i=1,\ldots,12.\]

#### Posteriors

Combining the prior and the likelihood, we get, for the first prior: \[P_1(\mu_i, \sigma_i \mid \{x_{k,i}\}; I) \propto \sigma_i^{-(N_i+1)}\prod_{k=1}^{N_i} \exp\left(\frac{-(\mu_i-x_{k,i})^2}{2\sigma_i^2}\right),\quad i=1,\ldots,12,\] while for the second prior:

\[P_2(\mu_i, \sigma_i \mid \{x_{k,i}\}; I) \propto \mu_i^{500}\sigma_i^{-(N_i+4)}\prod_{k=1}^{N_i} \exp\left(\frac{-(\mu_i-x_{k,i})^2}{2\sigma_i^2}\right),\quad i=1,\ldots,12,\]

over some suitable sub-region in parameter space.

The joint posterior distributions for \((\mu_1,\sigma_1)\) (one for each of the priors) when \(i=1\) are shown below.

We can read the likely values of each parameters for each scenario by looking at the spikes in the charts below.

### 18.2.3 Money (Dollar Bill Y’All)

The **question:** how many 5$ dollar bills are there in circulation?

The **problem:** we simply cannot count them all. So how do we provide an estimate?

The **solution:** “catch and release”

Capture a few 5$ bills.

Mark them and put them back in circulation.

At some later point, capture a few 5$ bills.

Count how many are marked.

For instance, \(x=500\) bills might have been marked initially; \(y=300\) bills might have been re-captured at stage 3, \(w=127\) of which were marked.

What is the most probable number of bills \(N\) in circulation?

Unlike the previous examples where we were trying to estimate the parameters from the data, in this example we are trying to estimate data from parameters (**generative model**) – we do not compute the likelihood directly.

#### Simple Model

In the simplest model, we might proceed as follows:

We start by drawing a

**large**random sample of # of bills \(N\) from an acceptable “prior” distribution on the parameters.Using the \(N\)s and the generative model (with \(x\) and \(y\) given – the observed values), we produce a (synthetic) # of marked bills \(z\) in each sample.

Finally, we only retain those values of \(N\) for which \(z=w\).

Let us implement this in `R`

using the values of \(x\), \(y\), and \(w\) provided above. We will generate priors using 500,000 replicates:

```
set.seed(1) # for replicability
N.draw = 500000 # number of replicates
x = 500 # number of bills marked in the initial capture
y = 300 # number of bills sampled in the second capture
w = 127 # number of marked bills in the second capture
```

Since \(x=500\) were first “captured”, we know that there are at least 500 bills in circulation. To keep things from getting out of hands, we select a theoretical maximum for the number of bills in circulation.

```
upper.limit = 1500 # maximum (theoretical) number of bills
bin.width = 50 # for plotting the posterior
```

We now draw to create the prior distribution on the possible number of bills \(N_{\text{bills}}\) in circulation:

```
N.bills = sample (x:1500, N.draw, replace=TRUE)
barplot(table(cut(N.bills, seq(x, upper.limit, bin.width))) /
length(N.bills), col = "gray")
```

A priori, all of these are “equally likely”. Now, we use the observed “catch-and-release” data to define the generative model, in which we capture \(x=500\) bills in the first round, and \(y=300\) in the second round:

```
pick.bills <- function(N.bills) {
bills <- rep(0:1, c(N.bills - x, x)) # 0 for un-marked, 1 for marked in the inital capture
sum(sample(bills, y)) # sampling y bills in the second capture
}
```

The number of re-captured bills (for each trial) is simulated below:

In the language of the generative model, `N.marked`

is \(z\). Now, we only keep those trials for which there were \(w=127\) re-captured marked bills, and retain the number of bills in circulation for these trials:

Finally, we plot the posterior distribution:

The summary statistics for the posterior distribution of the number of bills in circulation is thus:

```
[1] 4754
Min. 1st Qu. Median Mean 3rd Qu. Max.
979 1143 1188 1193 1236 1492
```

In other words, out of 500,000 trials, a little fewer than 5000 had the right characteristics (\(x\), \(y\), and \(w\) as observed in the “real world”), and the average/median number of bills in circulations for this smaller subset of trials is a tad below 1200.

The Bayesian situation is illustrated below.^{339}

#### Marked Bills are Brittle Model

It may be the case that the process of marking the bills might damage them somehow, so that they may be retired sooner than one would expect (with probability \(u=90\%\), say).

In this case, we might proceed as follows:

We start by drawing a

**large**random sample of # of bills \(N\) from an acceptable “prior” distribution on the parameters.Using the \(N\)s and the generative model (with \(x\), \(y\), and \(u\) given – the observed values), we produce a (synthetic) # of marked bills \(z\) in each sample.

Finally, we only retain those values of \(N\) for which \(z=w\).

Let us implement this in `R`

using the values of \(x\), \(y\), \(u\), and \(w\) provided above. We will generate priors using 500,000 replicates:

```
set.seed(10) # for replicability
N.draw = 500000 # number of replicates
x = 500 # number of bills marked in the initial capture
y = 300 # number of bills sampled in the second capture
w = 127 # number of marked bills in the second capture
u = 0.9 # probability that marked bills will be retired
upper.limit = 1500 # maximum (theoretical) number of bills
bin.width = 50 # for plotting the posterior
N.bills = sample (x:1500, N.draw, replace=TRUE)
barplot(table(cut(N.bills, seq(x, upper.limit, bin.width))) /
length(N.bills), col = "gray")
```

A priori, all of these are “equally likely” in the brittle scenario too. Now, we use the observed “catch-and-release” data to define the generative model, in which we capture \(x=500\) bills in the first round, and \(y=300\) in the second round, knowing that \(u=0.9\) of first round marked bills will be retired.^{340}

```
pick.bills <- function(N.bills) {
bills <- rep(0:1, c(N.bills - x, x)) # 0 for un-marked, 1 for marked in the inital capture
prob.pick <- ifelse(bills == 0, 1.0, u) # takes care of brittleness
sum(sample(bills, y, prob = prob.pick)) # sampling y bills in the second capture
}
```

The number of re-captured bills (for each trial) is simulated below:

```
N.marked <- rep(NA, N.draw)
for(i in 1:N.draw) {
N.marked[i] <- pick.bills(N.bills[i])
}
# Posterior distribution
post.bills <- N.bills[N.marked == w]
barplot(table(cut(post.bills, seq(x,upper.limit,bin.width))) /
length(post.bills), col = "blue")
```

The summary statistics for the posterior distribution of the number of bills in circulation is thus:

```
[1] 4410
Min. 1st Qu. Median Mean 3rd Qu. Max.
935 1089 1129 1132 1172 1411
```

In other words, out of 500,000 trials, about 4400 had the right characteristics (\(x\), \(y\), \(u\), and \(w\) as observed in the “real world”), and the average/median number of bills in circulations for this smaller subset of trials is a roughly 1130. Does this make sense, given the brittleness assumption?

The Bayesian situation is illustrated below.^{341}

#### Listen to the Banker Model

Let us say that an old banker thinks that there should be about 1000 bills in circulation. How can we incorporate this piece of information?

In this case, we might proceed as follows:

Let us implement this in `R`

using the values of \(x\), \(y\), \(u\), and \(w\) provided above, as well as the expert’s best guess. We will generate priors using 500,000 replicates:

```
set.seed(100) # for replicability
N.draw = 500000 # number of replicates
x = 500 # number of bills marked in the initial capture
y = 300 # number of bills sampled in the second capture
w = 127 # number of marked bills in the sample
u = 0.9 # probability that marked bills will be retired
banker.mean = 1000 # banker guess
upper.limit = 1500 # maximum (theoretical) number of bills
bin.width = 50 # for plotting the posterior
```

We now draw to create the prior distribution on the possible number of bills \(N_{\text{bills}}\) in circulation, using the banker’s experience (instead of a uniform distribution, the prior might follow a binomial distribution with mean \(1000\$\), say).

```
N.bills = rnbinom(N.draw, mu = banker.mean - x, size = w) + x
barplot(table(cut(N.bills, seq(x, upper.limit, bin.width))) /
length(N.bills), col = "gray")
```

```
pick.bills <- function(N.bills) {
bills <- rep(0:1, c(N.bills - x, x)) # 0 for un-marked, 1 for marked in the inital capture
prob.pick <- ifelse(bills == 0, 1.0, u) # takes care of brittleness
sum(sample(bills, y, prob = prob.pick)) # sampling y bills in the second capture
}
```

The number of re-captured bills (for each trial) is simulated below:

```
N.marked <- rep(NA, N.draw)
for(i in 1:N.draw) {
N.marked[i] <- pick.bills(N.bills[i])
}
# Posterior
post.bills <- N.bills[N.marked == w]
barplot(table(cut(post.bills, seq(x,upper.limit,bin.width))) /
length(post.bills), col = "blue")
```

The summary statistics for the posterior distribution of the number of bills in circulation is thus:

```
[1] 5258
Min. 1st Qu. Median Mean 3rd Qu. Max.
893 1031 1057 1058 1083 1209
```

In other words, out of 500,000 trials, about 5250 had the right characteristics (\(x\), \(y\), \(u\), and \(w\) as observed in the “real world”), and the average/median number of bills in circulations for this smaller subset of trials is a roughly 1050. Does this make sense, given the banker’s opinion and the observations?

The Bayesian situation is illustrated below.

### References

*Probability Theory: the Logic of Science*. Cambridge Press, 2003.