18.2 Examples

We take a look at three scenarios that will shed some light on the whole Bayesian entreprise:

  • determining if a coin is fair (or not),

  • finding a link between demographic information and salary, and

  • estimating the number of dollar bills in circulation.

These examples will show how priors, likelihood, and posteriors interact.

18.2.1 The Mysterious Coin

A mysterious stranger brings back a souvenir coin from a trip to a strange and distant land. They have been flipping it pretty much non-stop since their return. You can see the proportion of heads they obtained for 4, 8, and 16 tosses.

It might seem at first that the coin might be biased, but the proportion of heads seems to inch its way towards 50%. What is truly going on?

Priors

Perhaps the coin is not fair, coming as it does from a strange and distant land. Let us denote the coin’s bias by \(H\), i.e., the probability of flipping a head on a toss (\(H\approx 0.5\): regular unbiased coins; \(H\approx 0, 1\): highly biased coins). A prior for this scenario is a probability density function (pdf) \[P(\text{bias}=H)=P(H\mid I).\] Four such priors are shown below.

4 priors for the fair coin problem: I have no idea (top left); I suspect foul play (top right); it's just a regular coin (bottom left); I think it's just a regular coin, but the fact that somebody is asking makes me doubtful... (bottom right)

Figure 18.3: 4 priors for the fair coin problem: I have no idea (top left); I suspect foul play (top right); it’s just a regular coin (bottom left); I think it’s just a regular coin, but the fact that somebody is asking makes me doubtful… (bottom right)

Why are we working with functions for the prior, when in the previous example (Sept. 11 attacks), we only provided a number, \(P(B\mid I)=0.005\%\)? In fact, we had provided a (discrete) function as a prior: \[P(B=x\mid I)=\begin{cases} 0.005\% &\text{if }x=\text{TRUE} \\ 99.995\% &\text{if }x=\text{FALSE} \end{cases} \]

Likelihood

Let us assume that the coin has been tossed \(N\) times in total, and that \(K\) heads have been recorded. In this scenario, Bayes’ Theorem takes the form:

\[P(\text{bias}=H\mid K \text{ heads}, N \text{ tosses};I)\propto P(K \text{ heads}, N \text{ tosses} \mid \text{bias}=H;I) \times P(\text{bias}=H \mid I).\]

The likelihood is the probability of observing \(K\) heads in \(N\) tosses if the bias is \(H\). If, as part of \(I\), the tosses are independent (i.e., the result of one toss does not affect the others), then the likelihood is given by the binomial distribution \[ P(K \text{ heads}, N \text{ tosses} \mid \text{bias}=H;I) = \binom{N}{K}H^K(1-H)^{N-K} .\]

Posteriors

Combining the prior and the likelihood, we get: \[ P(\text{bias}=H \mid K \text{ heads}, N \text{ tosses};I) \propto H^K(1-H)^{N-K}\times P_i(\text{bias}=H\mid I),\] where \(i\) indexes the various prior scenarios described above.

We should thus be able to estimate the bias \(H^*\) by studying the posterior distribution for each of the 4 priors, for various number of throws \(N\).

  • With the non-committal prior (blue pdf) \(P_1(\text{bias}=H\mid I) \propto 1\), the posterior is simply proportional to the likelihood; the central limit theorem seems to kick in after \(\approx 30\) tosses.
After 128 tosses (with this specific series of tosses) starting with the non-committal prior, we are fairly certain that the coin must be biased ($0.25\leq H^*\leq 0.40$?).

Figure 18.4: After 128 tosses (with this specific series of tosses) starting with the non-committal prior, we are fairly certain that the coin must be biased (\(0.25\leq H^*\leq 0.40\)?).

  • With the foul play prior (green pdf), we suspect early on that the bias is smaller than 0.5; the subsequent series of tosses moves the bias to a value \(0.25\leq H^*\leq 0.40\) quickly, as was the case with the non-informative prior. Note the shrinking of the posterior with an increasing number of tosses.
After 256 tosses (with this specific series of tosses) starting with the foul play prior, we are fairly certain that the coin must be biased ($0.25\leq H^*\leq 0.40$?).

Figure 18.5: After 256 tosses (with this specific series of tosses) starting with the foul play prior, we are fairly certain that the coin must be biased (\(0.25\leq H^*\leq 0.40\)?).

  • With the regular coin prior (orange pdf) \(P_1(\text{bias}=H\mid I) \sim \mathcal{N}(0.5,\sigma^2)\), early results do not strongly suggest that the coin is biased (the prior gives little credence to the notion that the bias could lie in \(0.25\leq H^*\leq 0.40\)), but the series of tosses forces the posterior to a biased distribution (note the smoother convergence of the posterior).
After 512 tosses (with this specific series of tosses) starting with the regular coin prior, we are fairly certain that the coin must be biased ($0.33\leq H^*\leq 0.40$?).

Figure 18.6: After 512 tosses (with this specific series of tosses) starting with the regular coin prior, we are fairly certain that the coin must be biased (\(0.33\leq H^*\leq 0.40\)?).

  • With the doubtful prior (yellow pdf), the competing hypotheses compete before converging to a bias, again in \(0.25\leq H^*\leq 0.40\). The convergence is more haphazard: as soon as one head or one tail is observed, the process nixes the two-sided coin option. Note the slower (and weirder) convergence to a gaussian posterior.
After 512 tosses (with this specific series of tosses) starting with a doubtful prior, we are fairly certain that the coin must be biased ($0.33\leq H^*\leq 0.40$?).

Figure 18.7: After 512 tosses (with this specific series of tosses) starting with a doubtful prior, we are fairly certain that the coin must be biased (\(0.33\leq H^*\leq 0.40\)?).

In the fair coin example, it would seem that the choice of a prior does not have much of an effect on the posterior … given enough data. This will not always be the case.

18.2.2 The Salary Question

Income information has been collected for 4782 individuals, together with demographic details: self-reported gender, age group, and education level (1 for post-secondary degree, 0 otherwise). The table below shows some of the summary statistics for the dataset; the dataset is available in Salary.xlsx.

Summary statistics for the salary dataset, with categorical breakdown.

Figure 18.8: Summary statistics for the salary dataset, with categorical breakdown.

Question: is there a link between demographic information and income? How would we answer this question using classical statistical methods? What if we had reason to suspect that reported incomes follow a (potentially) different distribution for each group? Would that change the approach?

In the Bayesian framework, we would be interested in the posterior distribution \[P(\text{parameters}\mid \text{data};i,I),\quad i=1,\ldots, 12.\] If we assume (for no particular good reason) that the reported incomes are normally distributed for each group, then we seek \[P(\mu_i,\sigma_i\mid \text{reported salaries in group $i$};I),\quad i=1,\ldots, 12.\]

Priors

Determining a reasonable collection of priors \(P(\mu_i,\sigma_i\mid I)\), for \(i=1,\ldots, 12\), is no easy task. One could naively pick a joint distribution which “peaks” at the sample mean \(\overline{x}_i\), with standard deviation \(s_i\), for each group \(i\), but there are sampling design issues associated with this approach.

Why not select, instead, a prior “which expresses complete ignorance except for the fact that \(\mu_i\) is a location parameter and \(\sigma_i\) is a scale parameter” [27], [364]. This translates into using a prior \[P_1(\mu_i,\sigma_i\mid I) \propto \sigma_i^{-1},\quad i=1,\ldots,12\] (we will discuss non-informative priors in the next section). For comparison’s sake, we will also consider the prior \[P_2(\mu_i,\sigma_i\mid I) \propto \mu_i^{500}\sigma_i^{-4},\quad i=1,\ldots,12.\]

Two priors for the salary problem. Green represents large probabilities; red, low probabilities. The blue zones represent marginal probabilities of higher values.

Figure 18.9: Two priors for the salary problem. Green represents large probabilities; red, low probabilities. The blue zones represent marginal probabilities of higher values.

What could those priors represent, in the real world? What happens to the probabilities when \(\sigma_i\) increases? When \(\mu_i\) increases? Note, as well, that these ”priors” are not normalizable over the positive quadrant in \((\mu,\sigma)-\)space.338 Instead, we could only consider them over a suitable finite sub-region; or use the fact that the product of the likelihood and the prior IS normalizable.

Likelihood

Let’s denote the number of observations in group \(i\) by \(N_i\). The likelihood is the probability \[P(\text{reported incomes }\{x_{k,i}\} \text{ in group } i \mid \mu_i, \sigma_i; I),\quad i=1,\ldots,12.\]

We have assumed normality for any given observation. If we assume further that all observations are independent, then \[P(\{x_{k,i}\}\mid \mu_i, \sigma_i; I) \propto \prod_{k=1}^{N_i} \sigma_i^{−1} \exp\left(\frac{-(\mu_i-x_{k,i})^2}{2\sigma_i^2}\right),\quad i=1,\ldots,12.⁡\]

Posteriors

Combining the prior and the likelihood, we get, for the first prior: \[P_1(\mu_i, \sigma_i \mid \{x_{k,i}\}; I) \propto \sigma_i^{-(N_i+1)}\prod_{k=1}^{N_i} \exp\left(\frac{-(\mu_i-x_{k,i})^2}{2\sigma_i^2}\right),\quad i=1,\ldots,12,⁡\] while for the second prior:

\[P_2(\mu_i, \sigma_i \mid \{x_{k,i}\}; I) \propto \mu_i^{500}\sigma_i^{-(N_i+4)}\prod_{k=1}^{N_i} \exp\left(\frac{-(\mu_i-x_{k,i})^2}{2\sigma_i^2}\right),\quad i=1,\ldots,12,⁡\]

over some suitable sub-region in parameter space.

The joint posterior distributions for \((\mu_1,\sigma_1)\) (one for each of the priors) when \(i=1\) are shown below.

Posteriors for the salary problem (one for each of the priors), for group $i=1$. Green represents large probabilities; red, low probabilities. Note the shape of the posteriors. The blue zones represent marginal probabilities of higher values.

Figure 18.10: Posteriors for the salary problem (one for each of the priors), for group \(i=1\). Green represents large probabilities; red, low probabilities. Note the shape of the posteriors. The blue zones represent marginal probabilities of higher values.

We can read the likely values of each parameters for each scenario by looking at the spikes in the charts below.

Marginal posteriors for the salary problem (one for each of the priors), for group $i=1$. Note the differences in the distributions for each scenario.

Figure 18.11: Marginal posteriors for the salary problem (one for each of the priors), for group \(i=1\). Note the differences in the distributions for each scenario.

18.2.3 Money (Dollar Bill Y’All)

The question: how many 5$ dollar bills are there in circulation?

The problem: we simply cannot count them all. So how do we provide an estimate?

The solution: “catch and release”

  1. Capture a few 5$ bills.

  2. Mark them and put them back in circulation.

  3. At some later point, capture a few 5$ bills.

  4. Count how many are marked.

For instance, \(x=500\) bills might have been marked initially; \(y=300\) bills might have been re-captured at stage 3, \(w=127\) of which were marked.

What is the most probable number of bills \(N\) in circulation?

Unlike the previous examples where we were trying to estimate the parameters from the data, in this example we are trying to estimate data from parameters (generative model) – we do not compute the likelihood directly.

Simple Model

In the simplest model, we might proceed as follows:

Catch-and-release schematics in the simple model.

Figure 18.12: Catch-and-release schematics in the simple model.

  1. We start by drawing a large random sample of # of bills \(N\) from an acceptable “prior” distribution on the parameters.

  2. Using the \(N\)s and the generative model (with \(x\) and \(y\) given – the observed values), we produce a (synthetic) # of marked bills \(z\) in each sample.

  3. Finally, we only retain those values of \(N\) for which \(z=w\).

Let us implement this in R using the values of \(x\), \(y\), and \(w\) provided above. We will generate priors using 500,000 replicates:

set.seed(1) # for replicability
N.draw = 500000    # number of replicates
x = 500 # number of bills marked in the initial capture
y = 300 # number of bills sampled in the second capture
w = 127 # number of marked bills in the second capture

Since \(x=500\) were first “captured”, we know that there are at least 500 bills in circulation. To keep things from getting out of hands, we select a theoretical maximum for the number of bills in circulation.

upper.limit = 1500 # maximum (theoretical) number of bills
bin.width = 50     # for plotting the posterior

We now draw to create the prior distribution on the possible number of bills \(N_{\text{bills}}\) in circulation:

N.bills = sample (x:1500, N.draw, replace=TRUE) 
barplot(table(cut(N.bills, seq(x, upper.limit, bin.width))) / 
          length(N.bills), col = "gray")

A priori, all of these are “equally likely”. Now, we use the observed “catch-and-release” data to define the generative model, in which we capture \(x=500\) bills in the first round, and \(y=300\) in the second round:

pick.bills <- function(N.bills) {
  bills <- rep(0:1, c(N.bills - x, x)) # 0 for un-marked, 1 for marked in the inital capture
  sum(sample(bills, y)) # sampling y bills in the second capture       
}

The number of re-captured bills (for each trial) is simulated below:

N.marked <- rep(NA, N.draw) 
  for(i in 1:N.draw) {
    N.marked[i] <- pick.bills(N.bills[i])
  }

In the language of the generative model, N.marked is \(z\). Now, we only keep those trials for which there were \(w=127\) re-captured marked bills, and retain the number of bills in circulation for these trials:

post.bills <- N.bills[N.marked == w]

Finally, we plot the posterior distribution:

barplot(table(cut(post.bills, seq(x,upper.limit,bin.width))) / 
          length(post.bills), col = "blue")

The summary statistics for the posterior distribution of the number of bills in circulation is thus:

length(post.bills)
summary(post.bills)
[1] 4754
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    979    1143    1188    1193    1236    1492 

In other words, out of 500,000 trials, a little fewer than 5000 had the right characteristics (\(x\), \(y\), and \(w\) as observed in the “real world”), and the average/median number of bills in circulations for this smaller subset of trials is a tad below 1200.

The Bayesian situation is illustrated below.339

Marked Bills are Brittle Model

It may be the case that the process of marking the bills might damage them somehow, so that they may be retired sooner than one would expect (with probability \(u=90\%\), say).

In this case, we might proceed as follows:

Catch-and-release schematics in the brittle model.

Figure 18.13: Catch-and-release schematics in the brittle model.

  1. We start by drawing a large random sample of # of bills \(N\) from an acceptable “prior” distribution on the parameters.

  2. Using the \(N\)s and the generative model (with \(x\), \(y\), and \(u\) given – the observed values), we produce a (synthetic) # of marked bills \(z\) in each sample.

  3. Finally, we only retain those values of \(N\) for which \(z=w\).

Let us implement this in R using the values of \(x\), \(y\), \(u\), and \(w\) provided above. We will generate priors using 500,000 replicates:

set.seed(10) # for replicability
N.draw = 500000    # number of replicates
x = 500 # number of bills marked in the initial capture
y = 300 # number of bills sampled in the second capture
w = 127 # number of marked bills in the second capture
u = 0.9 # probability that marked bills will be retired
upper.limit = 1500 # maximum (theoretical) number of bills
bin.width = 50     # for plotting the posterior
N.bills = sample (x:1500, N.draw, replace=TRUE) 
barplot(table(cut(N.bills, seq(x, upper.limit, bin.width))) / 
          length(N.bills), col = "gray")
**CAPTION**

Figure 18.14: CAPTION

A priori, all of these are “equally likely” in the brittle scenario too. Now, we use the observed “catch-and-release” data to define the generative model, in which we capture \(x=500\) bills in the first round, and \(y=300\) in the second round, knowing that \(u=0.9\) of first round marked bills will be retired.340

pick.bills <- function(N.bills) {
  bills <- rep(0:1, c(N.bills - x, x)) # 0 for un-marked, 1 for marked in the inital capture
  prob.pick <- ifelse(bills == 0, 1.0, u) # takes care of brittleness
  sum(sample(bills, y, prob = prob.pick)) # sampling y bills in the second capture       
}

The number of re-captured bills (for each trial) is simulated below:

N.marked <- rep(NA, N.draw) 
  for(i in 1:N.draw) {
    N.marked[i] <- pick.bills(N.bills[i])
  }

# Posterior distribution
post.bills <- N.bills[N.marked == w]
barplot(table(cut(post.bills, seq(x,upper.limit,bin.width))) / 
          length(post.bills), col = "blue")

The summary statistics for the posterior distribution of the number of bills in circulation is thus:

length(post.bills)
summary(post.bills)
[1] 4410
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    935    1089    1129    1132    1172    1411 

In other words, out of 500,000 trials, about 4400 had the right characteristics (\(x\), \(y\), \(u\), and \(w\) as observed in the “real world”), and the average/median number of bills in circulations for this smaller subset of trials is a roughly 1130. Does this make sense, given the brittleness assumption?

The Bayesian situation is illustrated below.341

Listen to the Banker Model

Let us say that an old banker thinks that there should be about 1000 bills in circulation. How can we incorporate this piece of information?

In this case, we might proceed as follows:

Catch-and-release schematics in the expert model.

Figure 18.15: Catch-and-release schematics in the expert model.

Let us implement this in R using the values of \(x\), \(y\), \(u\), and \(w\) provided above, as well as the expert’s best guess. We will generate priors using 500,000 replicates:

set.seed(100) # for replicability
N.draw = 500000 # number of replicates
x = 500 # number of bills marked in the initial capture
y = 300 # number of bills sampled in the second capture
w = 127 # number of marked bills in the sample
u = 0.9 # probability that marked bills will be retired
banker.mean = 1000 # banker guess
upper.limit = 1500 # maximum (theoretical) number of bills
bin.width = 50     # for plotting the posterior

We now draw to create the prior distribution on the possible number of bills \(N_{\text{bills}}\) in circulation, using the banker’s experience (instead of a uniform distribution, the prior might follow a binomial distribution with mean \(1000\$\), say).

N.bills = rnbinom(N.draw, mu = banker.mean - x, size = w) + x 
barplot(table(cut(N.bills, seq(x, upper.limit, bin.width))) / 
          length(N.bills), col = "gray")

pick.bills <- function(N.bills) {
  bills <- rep(0:1, c(N.bills - x, x)) # 0 for un-marked, 1 for marked in the inital capture
  prob.pick <- ifelse(bills == 0, 1.0, u) # takes care of brittleness
  sum(sample(bills, y, prob = prob.pick)) # sampling y bills in the second capture       
}

The number of re-captured bills (for each trial) is simulated below:

N.marked <- rep(NA, N.draw) 
  for(i in 1:N.draw) {
    N.marked[i] <- pick.bills(N.bills[i])
  }

# Posterior
post.bills <- N.bills[N.marked == w]
barplot(table(cut(post.bills, seq(x,upper.limit,bin.width))) / 
          length(post.bills), col = "blue")

The summary statistics for the posterior distribution of the number of bills in circulation is thus:

length(post.bills)
summary(post.bills)
[1] 5258
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    893    1031    1057    1058    1083    1209 

In other words, out of 500,000 trials, about 5250 had the right characteristics (\(x\), \(y\), \(u\), and \(w\) as observed in the “real world”), and the average/median number of bills in circulations for this smaller subset of trials is a roughly 1050. Does this make sense, given the banker’s opinion and the observations?

The Bayesian situation is illustrated below.

References

[27]
E. T. Jaynes, Probability Theory: the Logic of Science. Cambridge Press, 2003.
[364]
T. Oliphant, A Bayesian perspective on estimating mean, variance, and standard-deviation from data,” vol. 278. All Faculty Publications, BYU, 2006.