18.3 Prior Distributions

Specifying a model means, by necessity, providing a prior distribution for the unknown parameters \(\boldsymbol{\theta}\). The prior plays a critical role in Bayesian inference through the updating statement : \[P(\boldsymbol{\theta}\mid D) \propto P(\boldsymbol{\theta}) \times P(D\mid \boldsymbol{\theta}).\] In the Bayesian approach, all unknown quantities are described probabilistically, even before the data has been observed.

All priors are subjective in the sense that the decision to use any prior is left completely up to the researcher. But the choice of priors is no more subjective than the choice of likelihood, the selection or collection of a sample, the estimation, or the statistic used for data reduction. The choice of a prior can substantially affect posterior conclusions, however, especially when the sample size is not large.

We now examine several broad methods of determining prior distributions.

18.3.1 Conjugate Priors

The main challenge of Bayesian methods is that the posterior distribution of the vector \(\boldsymbol{\theta}\) might not have an analytical form. Specifically, producing marginal posterior distributions from high-dimensional posteriors by repeated analytical integration may be difficult or even impossible mathematically.

There are exceptions however, providing easily obtainable computational posteriors through the use of a conjugate prior. Conjugacy is a joint property of a prior and a likelihood that implies that the posterior distribution has the same distributional form as the prior, but with different parameter(s).

The table below represents some common likelihoods and their conjugate priors (an extensive list can be found in [365]).

Likelihood Prior Hyperparameters
Bernoulli Beta \(\alpha>0,\beta>0\)
Binomial Beta \(\alpha>0,\beta>0\)
Poisson Gamma \(\alpha >0\), \(\beta>0\)
Normal for \(\mu\) Normal \(\mu\in\mathbb{R}\), \(\sigma^{2}>0\)
Normal for \(\sigma^{2}\) Inverse Gamma \(\alpha >0\), \(\beta>0\)
Exponential Gamma \(\alpha >0\), \(\beta>0\)

For instance, if the probability of \(s\) successes in \(n\) trials (the likelihood) is given by \[P(s,n \mid q)=\frac{n!}{s!(n-s)!}q^s(1-q)^{n-s}, \quad q\in [0,1],\] and the prior probability for \(q\) follows a \(\text{Beta}(\alpha,\beta)\) distribution with \(\alpha>0, \beta>0\) (so that \[P(q)=\frac{q^{\alpha-1}(1-q)^{\beta-1}}{B(\alpha,\beta)},\] for \(q\in [0,1]\)), then the posterior distribution for \(q\) given \(s\) successes in \(n\) trials follows a \(\text{Beta}(\alpha+s,\beta+n-s)\) distribution (so that \[P(q \mid s,n)=\frac{P(s,n \mid q)\times P(q)}{P(s,n)}=\frac{q^{\alpha+s-1}(1-q)^{\beta+n-s-1}}{B(\alpha+s,\beta+n-s)}\] for \(q\in [0,1]\)).

Conjugate priors are mathematically convenient, and they can be quite flexible, depending on the specific hyperparameters we use; but they reflect very specific prior knowledge and should be eschewed unless we truly possess that prior knowledge.

18.3.2 Uninformative Priors

An uninformative prior (perhaps better called an objective prior) is one in which little new explanatory power about the unknown parameter is provided by intention. Uninformative priors are very useful from the perspective of traditional Bayesianism seeking to mitigate the frequentist criticism of intentional subjectivity. These priors intentionally provide very little specific information about the parameter(s).

The rationale for using uninformative prior distributions is often said to be ‘to let the data speak for itself,’ so that inferences are unaffected by information external to the current data.

A classic uninformative prior is the uniform prior. A proper uniform prior integrates to a finite quantity and is thus normalizable. For example, for data following a \(\text{Bernoulli}(\theta)\) distribution, a uniform prior on \(\theta\) is \[P(\theta) = 1,\quad 0\leq\theta\leq 1.\] This approach makes sense when \(\theta\) has bounded support. But for data following a \(N(\mu,1)\) distribution, the uniform prior on the support of \(\mu\) is improper as \[P(\mu)=1, \quad -\infty<\mu<\infty\] diverges; however, such a choice could still be acceptable as long as the resulting posterior is normalizable (i.e., the integral of the posterior converges on its support). As there are instances where an improper prior yields an improper posterior, care is warranted.

This is also called the principle of indifference, which states that with no evidence one way or another, degrees of belief should be distributed equally among all the considered outcomes (but Bertrand’s Paradox provides doubt as to the validity of this principle).

Note that there are plenty of situations where the uniform prior is NOT an appropriate prior; such a prior makes assumptions about the distribution of the parameters of interest that fall squarely in the subjective camp. The use of uniform priors is quite often justified solely on the basis of convenience device (since the posterior is then simply proportional to the likelihood).

The Jeffreys prior is an approach to generate uninformative priors. For a given random parameter \(\boldsymbol{\theta}\), the Jeffreys prior is \[P(\boldsymbol{\theta}\mid I)\propto \sqrt{\det \mathcal{I}(\boldsymbol{\theta})}, \] where \(\mathcal{I}(\boldsymbol{\theta})\) represents the Fisher information, which measures the amount of information that an observable random variable \(X\) implies about an unknown parameter vector \(\boldsymbol{\theta}\) (i.e., we are interested in \(P(X\mid \boldsymbol{\theta})\)). Let \(f(X\mid \boldsymbol{\theta})\) be the corresponding p.d.f./p.m.f.;

\[ \bigl[\mathcal{I}(\boldsymbol{\theta}) \bigr]_{i, j} = -\operatorname{E}\left[\left. \frac{\partial^2}{\partial\theta_i\, \partial\theta_j} \log f(X\mid \boldsymbol{\theta}) \right|\boldsymbol{\theta}\right]\,.\]

Note that the Jeffreys prior depends on underlying statitistical model:

  • if \(X\) follows a normal distribution \(\mathcal{N}(\mu,\sigma^2)\), with \(\sigma\) fixed, and all we assume is that \(\mu\) is a location parameter, then the Jeffrey prior would be \[P(\mu\mid I) \propto 1,\] an improper uniform distribution (all locations are equally likely to be the mean);

  • if \(X\) follows a normal distribution \(\mathcal{N}(\mu,\sigma^2)\), with \(\mu\) fixed, and all we assume is that \(\sigma>0\) is a scale parameter, then the prior would be \[P(\sigma\mid I) \propto \frac{1}{\sigma},\] again an improper distribution, but one for which a dispsersion \(\sigma\) becomes progressively less likely as it increases;

  • if \(X\) follows a Poisson distribution \(\mathcal{P}(\lambda)\), with \(\mu\) fixed, and all we assume is that \(\lambda\geq 0\), then the prior would be \[P(\lambda\mid I) \propto \frac{1}{\sqrt{\lambda}};\] once again improper.

In contrast, a weakly informative prior is one for which only partial information about a variable is available; the choice of a uniform prior is often weakly informative.

We shall discuss another uninformative approach, the Maximum Entropy prior, in a subsequent section.

18.3.3 Informative Priors

Informative priors are those that deliberately insert information that researchers have at hand. This seems like a reasonable approach since previous scientific knowledge should play a role in doing statistical inference.

However, there are two important requirements for researchers:

  1. overt declaration of prior specification, and

  2. detailed sensitivity analysis to show the effect of these priors relative to uninformed types.

Transparency is required to avoid the common pitfall of data fishing; sensitivity analysis can provide a sense of exactly how informative the prior is. But where do informative priors come from, in the first place?

Generally these priors are derived from:

  • past studies, published work, researcher intuition;

  • interviewing domain experts;

  • convenience with conjugacy, and

  • non-parametric and other data-derived sources.

Prior information from past studies need not be in agreement. One useful strategy is to construct prior specifications from competing school-of-thoughts in order to contrast the resulting posteriors and produce informed statements about the relative strength of each of them.

Example: we have noted previously that a Bernoulli likelihood and a Beta prior form a set of conjugate priors. For this exercise, we use the R function BernBeta() defined in [366] (notice that the function returns the posterior beta values each time it is called, so returned values can be fed back into the prior in a subsequent function call).

  1. Start with a prior distribution that expresses some uncertainty that a coin is fair: \(\text{Beta}(\theta \mid 4, 4)\). Flip the coin once; assume that a Head is obtained. What is the posterior distribution of the uncertainty in the coin’s fairness \(\theta\)?

Solution: we know, on theoretical grounds, that the posterior follows a \[\text{Beta}(\theta \mid 4+1,4+1-1;I)=\text{Beta}(\theta \mid 5,4;I)\] distribution.

The label on the \(y-\)axis of the posterior distribution provides the posterior parameters.

post = BernBeta( c(4,4) , c(1) )
show(post)

[1] 5 4

This function uses the conjugacy between the Bernoulli (likelihood) and the Beta (prior) distributions to determine the posterior distribution Beta for the uncertainty in the fairness of the coin (1 represents a H(ead) on the flip, 0 a T(ail)).

  1. Use the posterior parameters from the previous flip as the prior for the next flip. Suppose we flip again and get a H. What is the new posterior on the uncertainty in the coin’s fairness?

Solution: on theoretical grounds, the posterior is \[\text{Beta}(\theta \mid 6,4;I),\] which is shown below.

post = BernBeta( post , c(1) )
show(post)

[1] 6 4
  1. Using the most recent posterior as the prior for the next flip, flip a third time and obtain yet again a H. What is the new posterior?

Solution: in this case, we know that the posterior for the coin’s fairness follows a \(\text{Beta}(\theta \mid 7,4; I)\) distribution.

post = BernBeta( post , c(1) )
show(post)

[1] 7 4

Should flipping 3 H in a row give us pause? Is there enough evidence to suggest that \(\theta\neq 0.5\) (i.e, that the coin is not fair)? What if we flipped 18 H in a row from this point on?

post = BernBeta( post , c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) )
show(post)

[1] 25  4

When working on a problem, it can be easy to get side-tracked and confused with the notation. In those cases, it is useful to return to the definition of each of the terms in Bayes’ theorem (i.e., \(P(\theta \mid D; I)\), \(P(D \mid I)\), \(P(D \mid \theta;I)\), etc.).

Example: suppose that a friend has a coin that we know comes from a magic store; as a result, we believe that the coin is strongly biased in either of the two directions (it could be a trick coin with both sides being H, for instance), but we don’t know which one it favours. We will express the belief of this prior as a Beta distribution. Let’s say that our friend flips the coin five times; resulting in 4 H and 1 T. What is the posterior distribution of the coin’s fairness \(\theta\)?

Solution: we start with a prior that corresponds with our assumptions, and assume 4 H and 1 T:

post = BernBeta( c(1,1)/100 , c(1,1,1,1,0) )
show(post)

[1] 4.01 1.01

This prior captures our belief that the coin is strongly biased (although we do not know in which direction the bias lies before seeing data). The choice of 0.01 is arbitrary, in a sense; 0.1 would have worked as well, for instance.

The posterior distribution is \[\text{Beta}(\theta \mid 4.01,1.01; I),\] which, as shown above, has its mode essentially at \(1.0\), and not near the mean \(\approx 0.8\). Is the coin indeed biased? In which direction?

How would the answer change if we had no reason to suspect that the coin was biased in the first place?

18.3.4 Maximum Entropy Priors

Whether the priors are uninformative or informative, we search for the distribution that best encodes the prior state of knowledge from a set of trial distributions.

Consider a discrete space \(X\) of cardinality \(M\) with probability density \(P(X)=\mathbf{p}=(p_{1},...,p_{M})\). The entropy of such a \(\mathbf{p}\), denoted by \(H(\mathbf{p})\), is given by \[H(\mathbf{p}) = - \sum^{M}_{i=1} p_{i} \log p_{i}, \quad \text{with }0\cdot\log(0)=0.\] In the case of a continuous p.d.f. \(P(\mathbf{X})=P(X_1,\ldots,X_n)\) on some domain \(\Omega\subseteq \mathbb{R}^n\), the entropy is given by \[H(P)=-\int_{\Omega}P(\mathbf{Z})\log(P(\mathbf{Z}))d\mathbf{Z}.\]

The maximum entropy principle (MaxEnt) states that, given a class of trial distributions with constraints, the optimal prior is the trial distribution with the largest entropy. As an example, the most basic constraint is for \(\mathbf{p}\) to lie in the probability simplex, that is, \(\sum_{i} p_{i} = 1\) and \(p_{i}\geq 0\) for all \(i\) in the discrete case, or \(\int_{\Omega}P(\mathbf{Z})d\mathbf{Z}=1\) and \(P(\mathbf{Z})\geq 0\) on \(\Omega\) in the continuous case.

Example: without constraints, the MaxEnt principle yields a prior which solves the optimization problem: \[\begin{array}{rl} \max & - p_{1} \log p_{1} - \cdots - p_M\log p_M \\ \mbox{s.t.} & p_{1} + \cdots + p_M = 1 \mbox{ and } p_1,\ldots, p_M \geq 0 \end{array}\]

Using the method of Lagrange multipliers, this optimization reduces to \[\mathbf{p}^{*} = \arg_\mathbf{p}\max \{H(\mathbf{p}) - \lambda ( p_{1} + \cdots + p_M - 1)\},\] whose solution is \(\mathbf{p}^{*} \propto \text{constant}.\) Hence, subject to no additional constraints, the uniform distribution is the maximum entropy prior.

Example: How could we use Bayesian analysis to predict the cab waiting time?

“The joke about New York is that you can never get a cab, except when you don’t need a cab, and then there are cabs everywhere” (quote and example from S.DeDeo’s Maximum Entropy Methods tutorial [367]).

At various moments, we head out to the street to hail a cab, and we keep track of how long it took before a cab was available. Perhaps the observations (in minutes) look like this \[6,3,4,6,2,3,2,6,4,4.\] What can you conclude about the waiting time for a New York City cab?

Solution: in the best case scenario a cab is waiting for us as we get to the curb \((j=0)\), while in the worst case scenario (a zombie apocalypse, say?), no cab ever comes \((j\to\infty)\). But can anything else be said?

To use MaxEnt in this situation, we need to find – among all of the trial distributions that could have generated the observed waiting times – the one with the highest entropy. Unfortunately, there are infinitely many such distributions.

We can narrow the search, however, by including a constraint stating that the expected value of the trial distributions should be the same as the mean of the sample: in this case, 4.

The two constraints translate to \[g_1(\mathbf{p})=\sum_{j=0}^{\infty}j\cdot p_j -4=0 \quad \mbox{and}\quad g_2(\mathbf{p})= \sum_{j=0}^{\infty}p_j-1=0,\] where \(p_j\) is the probability of having to wait \(j\) minutes for a cab.

The method of Lagrange multipliers reduces the problem to solving \[\arg_{\mathbf{p}}\max \left\{H(\mathbf{p}) - \lambda_1 g_1(\mathbf{p})-\lambda_2g_2(\mathbf{p})\right\}.\]

This requires solving the gradient equation \[\nabla_\mathbf{p} H(\mathbf{p}) = \lambda_1\nabla_\mathbf{p} g_1(\mathbf{p}) + \lambda_2\nabla_\mathbf{p} g_2(\mathbf{p}),\] which gives rise to equations of the form \[-(\ln p_j +1) = \lambda_1 j + \lambda_2, \quad {j=0,1, \ldots,}\] or simply \(p_j=\exp(-\lambda_1j)\exp(-1-\lambda_2)\) for \(j=0, 1, \ldots\).

Since \[1=\sum_{j=0}^{\infty}p_j = \exp(-1-\lambda_2)\sum_{j=0}^{\infty}\exp(-\lambda_1j),\] we have \[\exp(1+\lambda_2) = \sum_{j=0}^{\infty}\exp(-\lambda_1j)=\frac{1}{1-\exp(-\lambda_1)},\] assuming that $ | (-_1) | <1$.

Similarly, \[4=\sum_{j=0}^{\infty}jp_j = \exp(-1-\lambda_2)\sum_{j=0}^{\infty}j\exp(-\lambda_1j),\] so that \[4\exp(1+\lambda_2) = \sum_{j=0}^{\infty}j\exp(-\lambda_1j)=\frac{\exp(-\lambda_1)}{(1-\exp(-\lambda_1))^2}.\]

Substituting the first of these into the latter, and solving for \(\lambda_1\), we see that \(\lambda_1=\ln (5/4)\). Substituting that result back into the first equation, we further obtain \(\exp(-1-\lambda_2)=\frac{1}{5}\), so that \[p_j = \exp(-1-\lambda_2) \exp(-\lambda_1 j)=\frac{1}{5}\left(\frac{4}{5}\right)^j, j=0,\ldots\]

It is easy to see that this defines a distribution; a “verification” is provided by the following code.

pmf_maxent <- function(x,lambda=4/5) (1-lambda)*(lambda)^x
sum(pmf_maxent(0:100))  # check if it's a distribution
mp <- barplot(pmf_maxent(0:15), ylim=c(0,.25), xlab="waiting minutes")
axis(1,at=mp,labels=paste(0:15))

[1] 1

This distribution could be used as a prior in a Bayesian analysis of the situation. Notice that some information about the data (in this case, only the sample mean) is used to define the MaxEnt prior. Crucially, however, the data that is used to build the MaxEnt prior CANNOT be re-used as part of the likelihood computations. The situation is not unlike that of the training/testing paradigm of machine learning.

References

[365]
Wikipedia, Conjugate priors.”
[366]
J. K. Kruschke, Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan (2nd ed.). Academic Press, 2011.
[367]