18.5 Additional Topics

According to [369],

the central feature of Bayesian inference is the direct quantification of uncertainty.

Bayesian approach to modeling uncertainty is particularly useful when:

  • the available data is limited;

  • there is some concern about overfitting;

  • some facts are more likely to be true than others, but that information is not contained in the data, or

  • the precise likelihood of certain facts is more important than solely determining which fact is most likely (or least likely).

As discussed previously, Bayesian methods have a number of powerful features. They allow analysts to

  • incorporate specific previous knowledge about parameters of interest;

  • logically update knowledge about the parameter after observing sample data;

  • make formal probability statements about parameters of interest;

  • specify model assumptions and check model quality and sensitivity to these assumptions in a straightforward manner;

  • provide probability distributions rather than point estimates, and

  • treat the data values in the sample as interchangeable.

18.5.1 Uncertainty

The following example represents a Bayesian approach to dealing with the uncertainty of the so-called envelope paradox.

Example: you are given two indistinguishable envelopes, each containing a cheque, one being twice as much as the other. You may pick one envelope and keep the money it contains. Having chosen an envelope at will, but before inspecting it, you are given the chance to switch envelopes. Should you switch? What is the expected outcome in doing so? Explain how this game leads to infinite cycling.

Solution: let \(V\) be the (unknown) value found in the envelope after the first selection. The other envelope then contains either \(\frac{1}{2}V\) or \(2V\), both with probability \(0.5\), and the expected value of trading is \[E[\text{trade}]=0.5\times \frac{1}{2}V + 0.5 \times 2V = \frac{5}{4}V>V;\] and so it appears that trading is advantageous.

Let the (still unknown) value of the cheque in the new envelope be \(W\). The same argument shows that the expected value of trading that envelope is \(\frac{5}{4}W>W\), so it would make sense to trade the envelope once more, and yet once more, and so on, leading to infinite cycling.

There is a Bayesian approach to the problem, however. Let \(V\) be the (uncertain) value in the original selection, and \(W\) be the (also uncertain) value in the second envelope. A proper resolution requires a joint (prior) distribution for \(V\) and \(W\). Now, in the absence of any other information, the most we can say about this distribution using the maximum entropy principle is that \(P(V<W)=P(V>W)=0.5\).

By definition, if \(V<W\), then \(W=2V\); if, on the other hand, \(V>W\) then \(W=\frac{V}{2}\). We now show that the expected value in both envelopes is the same, and thus that trading envelope is no better strategy than keeping the original selection. Using Bayes’ Theorem, we compute that \[\begin{aligned} E[W]&=E[W|V<W]P(V<W) + E[W|V>W]P(V>W) \\ &=E[2V|V<W]\cdot 0.5+E[0.5V|V>W]\cdot 0.5 \\ &= E[V|V<W]+0.25\cdot E[V|V>W],\end{aligned}\] while \[\begin{aligned} E[V]&=E[V|V<W]P(V<W) + E[V|V>W]P(V>W) \\ &=0.5\cdot E[V|V<W]+ 0.5\cdot E[V|V>W].\end{aligned}\]

Before we can proceed any further, we must have some information about the joint distribution \(P(V,W)\) (note, however, that \(E[W]\) will not typically be equal to \(\frac{5}{4}V\), as had been assumed at the start of the solution). The domain \(\Omega\) of the joint probability consists of those pairs \((V,W)\) satisfying \(V=2W\) \((V>W)\) or \(W=2V\) (\(V<W\)) for \(0<V,W<M\), where \(M<\infty\) is some upper limit on the value of each cheque.344

We have assumed that the probability weight on each branch of \(\Omega\) is 1/2; if we further assume, say, that the cheque value is as likely to be any of the allowable values on these branches, then the joint distribution is \[P(V,W)=\begin{cases} \frac{1}{M} & \text{if $V<W$} \\ \frac{1}{2M} & \text{if $V>W$} \\ 0 & \text{otherwise} \end{cases}\] and the expectations listed above are \[E[V|V<W]=\int_{V<W}\!\!\!\!\!\!\!\!\!\!\! V\cdot P(V,W)\, d\Omega = \int_{0}^{M/2}\!\!\!\!\!\!\!\!V\cdot\frac{1}{M}\, dV=\frac{M}{8}\] and \[E[V|V>W]=\int_{V>W}\!\!\!\!\!\!\!\!\!\!\! V\cdot P(V,W)\, d\Omega = \int_{0}^{M}\!\!\!\!V\cdot\frac{1}{2M}\, dV=\frac{M}{4}.\]

Therefore, \[E[W]=\frac{M}{8}+0.25\cdot \frac{M}{4}=\frac{3M}{16}\] and \[E[V]=0.5\cdot \frac{M}{8}+0.5\cdot \frac{M}{4}=\frac{3M}{16},\] and switching the envelope does not change the expected value of the outcome.

There is no paradox; no infinite cycling.

Example: After the sudden death of her two baby sons, Sally Clark was sentenced by a U.K. court to life in prison in 1996. Among other errors, expert witness Sir Roy Meadow had wrongly interpreted the small probability of two cot deaths as a small probability of Clark’s innocence. After a long campaign, which included the refutation of Meadow’s statistics using Bayesian statistics, Clark was released in 2003. While Clark’s innocence could not be proven beyond the shadow of a doubt using such methods, her culpability could also not be established beyond reasonable doubt and she was cleared. An interesting write-up of the situation can be found online [370].

18.5.2 Bayesian A/B Testing

\(A/B\) testing is an excellent tool for deciding whether or not to roll out incremental features. To perform an \(A/B\) test, we divide users randomly into a test group and into a control group, then provide the new feature to the test group while letting the control group continue to experience the current version of the product.

If the randomization procedure is appropriate, we may be able to attribute any difference in outcomes between the two groups to the changes we are rolling out without having to account for other sources of variation affecting the user behaviour. Before acting on these results, however, it is important to understand the likelihood that any observed differences is merely due to chance rather than to product modification.

For example, it is perfectly possible to obtain different \(H/T\) ratios between two fair coins if we only conduct a limited number of tosses; In the same manner, it is possible to observe a change between the \(A\) and \(B\) groups even if the underlying user behavior is identical.

Example: (modified from [371]) Wakefield Tiles is a company that sells floor tiles by mail order. They are trying to become an active player into the lucrative Chelsea market by offering a new type of tile to the region’s contractors.

The marketing department have conducted a pilot study and tried two different marketing methods:

  • \(A\) – sending a colourful brochure in the mail to invite contractors to visit the company’s showroom;

  • \(B\) – sending a colourful brochure in the mail to invite contractors to visit the company’s showroom, while including free tile samples.

The marketing department sent out 16 mail packages of type \(A\) and 16 mail packages of type \(B\). Four Chelseaites that received a package of type \(A\) visited the showroom, while 8 of those receiving a package of type \(B\) did the same.

The company is aware that:

  • a mailing of type \(A\) costs 30$ (includes the printing cost and postage);

  • a mailing of type \(B\) costs 300$ (additionnaly includes the cost of the free tile samples);

  • a visit to the showroom yields, on average, 1000$ in revenue during the next year.

Which of the methods (\(A\) or \(B\)) is most advantageous to Wakefield Tiles?

Solution: the Bayesian solution requires the construction of a prior distribution and of a generative model; as part of the generative model, we will need to produce \(n\) replicates of samples from the binomial distribution (which can be achieved in R using rbinom(n,size,prob)).

The binomial distribution simulates n times the number of “successes” when performing size trials (mailings), where the probability of a “success” is prob. A commonly used prior for prob is the uniform distribution \(U(0,1)\), from which we can sample in R via runif(1, min = 0, max = 1).

We start by setting a seed for replicability, and set the number of replicates (trials).

set.seed(1111) # for replicability
n.draws <- 200000

Next, we generate a probability of success for mailings \(A\) and \(B\), for each of the replicates.

prior <- data.frame(p.A = runif(n.draws, 0, 1), 
                    p.B = runif(n.draws, 0, 1))

The generative model tells us how many visitors to expect for mailing types \(A\), \(B\), for each replicate.

generative.model <- function(p.A, p.B) {
  visitors.A <- rbinom(1, 16, p.A)
  visitors.B <- rbinom(1, 16, p.B)
  c(visitors.A = visitors.A, visitors.B = visitors.B)
}

We then simulate data using the parameters from the prior and the generative model. This yields the actual number of visitors for each replicate.

sim.data <- as.data.frame( t(sapply(1:n.draws, function(i) {
  generative.model(prior$p.A[i], prior$p.B[i])})))

Only those prior probabilities for which the generative model match the observed data are retained.

posterior <- prior[sim.data$visitors.A == 4 & sim.data$visitors.B == 8, ] 

In this case there are enough trials to ensure that the posterior is non-empty; what could be done if that was not the case?

Finally, we visualize the posteriors:

par(mfrow = c(1,3))
hist(posterior$p.A, main = "Posterior -- mailing A", xlab="p.A") 
hist(posterior$p.B, main = "Posterior -- mailing B", xlab="p.B")
plot(posterior,main = "Success for mailing types A and B", xlab="p.A", ylab="p.B")

The posterior distributions for the probability of success for each mailing types are shown in the figure below.

par(mfrow = c(1,2))
avg.profit.A <- -30 + posterior$p.A * 1000 
avg.profit.B <- -300 + posterior$p.B * 1000 
hist(avg.profit.A, main = "Average Profit -- mailing A", xlab="profit.A") 
hist(avg.profit.B, main = "Average Profit -- mailing B", xlab="profit.B")

In order to estimate the average profit for each mailing type, we use the posterior distributions for the probability of success.

hist(avg.profit.A - avg.profit.B, main="Posterior -- profit A - profit B")
(expected.avg.profit.diff <- mean(avg.profit.A - avg.profit.B))
abline(v = expected.avg.profit.diff , col = "red", lwd =2)

[1] 59.13869

The expected profit for mailing type \(A\) is around 60$ higher than for mailing type \(B\) (numbers may vary, depending on the seed). Keeping it simple seems to be a better idea in this context.

References

[369]
A. Gelman, J. B. Carloin, H. S. Stern, A. Dunson D. B.and Vehtari, and D. B. Rubin, Bayesian data analysis (3rd ed.). CRC Press, 2013.
[370]
[371]