18.4 Posterior Distributions

The posterior distribution is used to estimate a variety of model parameters of interest, such as the mean, the median, the mode, and so forth.

It is possible to construct credible intervals/regions directly from the posterior (in contrast to the “confidence” intervals of frequentist inference). Given a posterior distribution on a parameter \(\theta\in \mathbb{R}\), a \(1-\alpha\) credible interval C.I. \([L,U]\) is an interval such that \[P(L \leq \theta \leq U\mid D;I)\geq 1-\alpha.\] A similarly-flavoured construction can be used for a joint credible region \(\boldsymbol{\theta}\in \mathbb{R}^n.\)

Because the posterior is a full distribution on the parameters, it is possible to make all sorts of probabilistic statements about their values, such as:

  • “I am 95% sure that the true parameter value is bigger than 0.5”;

  • “There is a 50% chance that \(\theta_{1}\) is larger than \(\theta_{2}\)”;

  • etc.

18.4.1 High-Density Intervals

The most sensible approach is to build the credible interval of \(\theta\)-values using the highest density interval (HDI), i.e., to define a region \(C_k\) in parameter space with \[C_k = \left\{\theta: P(\theta \mid D;I)\geq k\right\},\] where \(k\) is the largest number such that \[\int_{C_k} P(\theta \mid D;I)\, d\theta = 1-\alpha.\]

This typically has the effect of finding the smallest region \(C_k\) (in measure) region meeting the criterion. The value \(k\) can be thought of the height of a horizontal line (or hyperplane, in the case of multivariate posteriors) overlaid on the posterior and whose intersection(s) with the latter define a region over which the integral of the posterior is \(1-\alpha\). In most cases, the value \(k\) can be found numerically.

We have seen some examples of HDIs in the previous sections.

Example: it is an election year and we are interested in knowing whether the general population prefers candidate \(A\) or candidate \(B\). A recently published poll states that of 400 randomly sampled voters, 232 preferred candidate \(A\), while the remainder preferred candidate \(B\).

  1. Suppose that before the poll was published, our prior belief was that the voter preference follows a uniform distribution (a Beta distribution with both parameters equal to 1). What is the 95% HDI on this belief after learning of the poll result?

    Solution: let preference for candidate \(A\) be denoted by 1, and preference for candidate \(B\) by 0. We can think of each voter’s preferance as arising from an independent Bernoulli trial (since the polled voters are selected randomly) and once again use the R function BernBeta().

    post = BernBeta( c(1,1), c( rep(1,232), rep(0,168) ) )
    show(post)

    [1] 233 169

    We see that the posterior distribution’s 95% HDI ranges from 0.531 to 0.628, in favour of candidate \(A\).

  2. Based on the poll, is it credible to believe that the population is equally divided in its preferences among candidates?

    Solution: the 95% HDI from the previous part shows that \(\theta=0.5\) is not among the credible values, hence it is not credible to believe that the population is equally divided in its preferences (at the 95%) level.

  3. Say we conduct a follow-up poll to narrow our estimate of the population’s preference. We randomly sample 100 people and find that 57 prefer candidate \(A\). Assuming that the opinion of people has not changed between polls, what is the 95% HDI on the posterior?

    Solution: using the previous posterior as a new prior, we obtain the following results.

    post = BernBeta( post, c( rep(1,57), rep(0,43) ) )
    show(post)

    [1] 290 212

    The 95% HDI is a bit narrower for preference for candidate \(A\) is a bit nanarrower, from 0.534 to 0.621.

  4. Based on the follow-up poll, is it credible to believe that the population is equally divided in its preferences among candidates?

    Solution: the 95% HDI from the previous results excludes \(\theta=0.5\); both the follow-up poll and the original poll suggest that the population is not equally divided (and actually prefers candidate \(A\)).

18.4.2 Markov Chain Monte Carlo Methods

The true power of Bayesian inference is most keenly felt when the model specifications lead to a posteriors that cannot be manipulated analytically; in that case, it is usually possible to recreate a synthetic (or simulated) set of values that share the properties with a given posterior.

Such processes are known as Monte Carlo simulations. A Markov chain is an ordered, indexed set of random variables (a stochastic process) in which the values of the quantities at a given state depends probabilistically only on the values of the quantities at the preceding state.

Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability distribution based on the construction of a Markov chain with the desired distribution as its equilibrium distribution. The state of the chain after a number of steps is then used as a sample of the desired distribution.

The quality of the sample improves as a function of the number of steps. MCMC techniques are often applied to solve integration and optimization problems in large-dimensional spaces.

These two types of problem play a fundamental role in machine learning, physics, statistics, econometrics and decision analysis. For instance, given variables \(\boldsymbol{\theta}\in\boldsymbol{\Theta}\) and data \(D\), the following (typically intractable) integration problems are central to Bayesian inference:

  • normalization – in order to obtain the posterior \(P(\boldsymbol{\theta} \mid D;I)\) given the prior \(P(\boldsymbol{\theta}\mid I)\) and likelihood \(P(D\mid \boldsymbol{\theta}; I)\), the normalizing (denominator) factor in Bayes’ theorem needs to be computed \[P( \boldsymbol{\theta} \mid D;I ) = \frac{P(\boldsymbol{\theta}\mid I) P(D\mid \boldsymbol{\theta};I)}{\int_{\boldsymbol{\Theta}} P(D\mid \boldsymbol{\theta};I) P(\boldsymbol{\theta}\mid I) d\boldsymbol{\theta}};\]

  • marginalization – given the joint posterior of \((\boldsymbol{\theta},x)\), we may often be interested in the marginal posterior \[P(\boldsymbol{\theta}\mid D;I) = \int P(\boldsymbol{\theta},x \mid D;I) dx;\]

  • expectation – the final objective of the analysis is often to obtain summary statistics of the form \[E(f(\boldsymbol{\theta})) = \int_{\boldsymbol{\Theta}} f(\boldsymbol{\theta}) P(\boldsymbol{\theta} \mid D;I) d\boldsymbol{\theta}\] for some function of interest (i.e., \(f(\boldsymbol{\theta}) =\boldsymbol{\theta}\) (mean), or \(f(\boldsymbol{\theta}) = (\boldsymbol{\theta}-E(\boldsymbol{\theta}))^{2}\) (variance)).

18.4.3 Metropolis-Hastings Algorithm

The Metropolis-Hastings (MH) algoirthm is a specific type of Monte Carlo process; it is likely among the ten algorithms that have had the greatest influence on the development and practice of science and engineering in recent times.

MH generates a random walk (that is, it generates a succession of posterior samples) in such a way that each step in the walk is completely independent of the preceding steps; the decision to reject or accept the proposed step is also independent of the walk’s history.

Any process for which the current step is independent (forgetful) of the previous states, namely \[P(X_{n+1}=x\mid X_1=x_1,\ldots,X_n=x_n;I)=P(X_{n+1}=x \mid X_n=x_n;I)\] for all \(n\), \(X_j\) and \(x_j\), \(j=1,\ldots, n\), is called a (first order) Markov process, and a succession of such steps is a (first order) Markov chain.

MH uses a candidate or proposal distribution for the posterior, say \(q(\cdot, \boldsymbol{\theta})\), where \(\boldsymbol{\theta}\) is a vector of parameters that is fixed by the user-called tuning parameters; MH then constructs a Markov Chain by proposing a value for \(\boldsymbol{\theta}\) from this candidate distribution, and then either accepting or rejecting this value (with a certain probability).

Theoretically the proposal distributions can be nearly any distribution, but in practice it is recommended that simple ones be selected: a normal if the parameter of interest can be any real number (e.g., \(\mu\)), or a log-normal if it has positive support (e.g., \(\sigma^{2}\)), say.

The Metropolis-Hastings (MH) algorithm simulates samples from a probability distribution by making use of the full joint density function and (independent) proposal distributions for each of the variables of interest.

The first step is to initialize the sample value for each random variable (often obtained by sampling from the variable’s prior distribution). The main loop of the algorithm consists of three components:

  1. generate a candidate sample \(x^*\) from the proposal distribution \(q(x^{(i)}| x^{(i-1)})\);

  2. compute the acceptance probability via the acceptance function \(\alpha(x^*|x^{(i-1)})\) based on the proposal distribution and the full joint density \(\pi(\cdot)\);

  3. accept the candidate sample with probability \(\alpha\), the acceptance probability, or reject it otherwise.

Example (modified from [366]): we use the MH algorithm to “learn” linear model parameters from a dataset. The test data for this example is generated using the following code. First, we establish the true model parameters.

set.seed(0)  # for replicability
t.A <- 10    # true slope
t.B <- 0     # true intercept
t.sd <- 20   # true noise
s.Size <- 50 # sample size

We will use equally spaced \(x\) values:

x <- (-(s.Size-1)/2):((s.Size-1)/2)

The corresponding \(y\) values are such that \(y\sim \mathcal{N}(ax + b,\sigma^2)\):

y <-  t.A * x + t.B + rnorm(n=s.Size,mean=0,sd=t.sd)

The \(x\) values are balanced around zero in order to “de-correlate” the slope and the intercept.

plot(x,y, main="Test Data")

Defining the statistical model. The next step is to specify the statistical model. We already know that the data was created with a linear relationship \(y = ax + b\) together with a normal error model \(\mathcal{N}(0,\sigma^2)\), so we might as well use the same model for the fit and see if we can retrieve our original parameter values. Note however that, in general, the generating model is unknown.

Deriving the likelihood function from the model. A linear model of the form \(y = ax + b + \mathcal{N}(0,\sigma^2)\) takes the parameters \((a, b, \sigma)\) as inputs. The output should be the probability of obtaining the test data under this model: in this case, we only need to calculate the difference between the predictions \(y = ax + b\) and the observed \(y\), and then look up the probability (using dnorm) for such deviations to occur.

likehd <- function(param){
  a = param[1]
  b = param[2]
  sd = param[3]
  pred = a*x + b
  singlelikelihoods = dnorm(y, mean = pred, sd = sd, log = T)
  sumll = sum(singlelikelihoods)
  return(sumll) 
}

For instance, we can find and plot the likelihood profile of the slope as follows:

s.values <- function(x){return(likehd(c(x, t.B, t.sd)))}
s.likehds <- lapply(seq(1/2*t.A, 3/2*t.A, by=.05), s.values )
plot (seq(1/2*t.A, 3/2*t.A, by=.05), s.likehds , type="l", 
      xlab = "values of slope parameter a", ylab = "Log likelihood")

Defining the priors. In Bayesian analysis, the next step is always required: we have to specify a prior distribution for each of the model parameters. To keep things simple, we will use a uniform distribution for the slope, and normal distributions for the noise and the intercept.342

prior <- function(param){
  a = param[1]
  b = param[2]
  sd = param[3]
  aprior = dunif(a, min=0, max=2*t.A, log = T)
  bprior = dnorm(b, mean=t.B, sd = 5, log = T)
  sdprior = dunif(sd, min=0, max=2*t.sd, log = T)
  return(aprior+bprior+sdprior)
}

The posterior. The product of prior by likelihood is the actual quantity that MCMC works with (it is not, strictly speaking, the posterior as it is not normalized).

posterior <- function(param){
  return (likehd(param) + prior(param))
}

Applying the MH algorithm. One of the most frequent applications of MH (as in this example) is sampling from the posterior density in Bayesian statistics.343 The aim of the algorithm is to jump around in parameter space, but in such a way as to have the probability to land at a point be proportional to the function we sample from (this is usually called the target function). In this case, the target function is the posterior that was defined previously.

This is achieved by

  1. starting with a random parameter vector;

  2. choosing a new parameter vector near the old value based on some probability density (the proposal function), and

  3. jumping to this new point with a probability \(\alpha=\min\{1,g(\text{new})/g(\text{old})\}\), where \(g\) is the target.

The distribution of the parameter vectors MH visits converges to the target distribution \(g\).

proposalfunction <- function(param){
  return(rnorm(3,mean = param, sd= c(0.1,0.5,0.3)))
}

run_metropolis_MCMC <- function(startvalue, iterations){
  chain = array(dim = c(iterations+1,3))
  chain[1,] = startvalue
  for (i in 1:iterations){
    proposal = proposalfunction(chain[i,])
    
    probab = exp(posterior(proposal) - posterior(chain[i,]))
    if (runif(1) < probab){
      chain[i+1,] = proposal
    }
    else{
      chain[i+1,] = chain[i,]
    }
  }
  return(chain)
}

startvalue = c(4,1,10) # random choice
chain = run_metropolis_MCMC(startvalue, 10000)

The first steps of the algorithm may be biased by the initialization process; they are usually discarded for the analysis (this is referred to as the burn-in time).

burnIn = 5000
acceptance = 1-mean(duplicated(chain[-(1:burnIn),]))

An interesting output to study is the acceptance rate: how often was a proposal rejected by the MH acceptance criterion? The acceptance rate can be influenced by the proposal function: generally, the nearer the proposal is to the latest value, the larger the acceptance rate.

Very high acceptance rates, however, are usually not beneficial, as this implies that the algorithms is “staying” in the same neighbourhood or point, which results in sub-optimal probing of the parameter space (there is very litte mixing). Acceptance rates between 20% and 30% are considered optimal for typical applications [368].

Finally, we can plot the results (the true parameter values are displayed in red).

par(mfrow = c(2,3))
hist(chain[-(1:burnIn),1],nclass=30, main="Posterior of a")
abline(v = mean(chain[-(1:burnIn),1]))
abline(v = t.A, col="red" )
hist(chain[-(1:burnIn),2],nclass=30, main="Posterior of b") 
abline(v = mean(chain[-(1:burnIn),2]))
abline(v = t.B, col="red" )
hist(chain[-(1:burnIn),3],nclass=30, main="Posterior of sd") 
abline(v = mean(chain[-(1:burnIn),3]) )
abline(v = t.sd, col="red" )
plot(chain[-(1:burnIn),1], type = "l", main = "Chain values of a")
abline(h = t.A, col="red" )
plot(chain[-(1:burnIn),2], type = "l", main = "Chain values of b") 
abline(h = t.B, col="red" )
plot(chain[-(1:burnIn),3], type = "l", main = "Chain values of sd")
abline(h = t.sd, col="red" )

The upper row shows posterior estimates for the slope \(a\), intercept \(b\), and standard deviation of the error \(\sigma\); the lower row shows the Markov Chain of parameter values. We retrieve (more or less) the original parameters that were used to create the data, and there is a certain area around the highest posterior values that also show some support by the data, which is the Bayesian equivalent of confidence intervals.

The posterior distributions plotted above are marginal distributions; the pairwise joint distributions are shown below (again, the true parameter values are shown in red):

plot(chain[-(1:burnIn),1:2], main="Scatter plot of a and b", 
     xlab="Estimates for a", ylab="Estimates for b")
abline(v = t.A, col="red" )
abline(h = t.B, col="red" )

plot(chain[-(1:burnIn),2:3], main="Scatter plot of b and sd", 
     xlab="Estimates for b", ylab="Estimates for sd")
abline(v = t.B, col="red" )
abline(h = t.sd, col="red" )

plot(chain[-(1:burnIn),c(1,3)], main="Scatter plot of a and sd", 
     xlab="Estimates for a", ylab="Estimates for sd")
abline(v = t.A, col="red" )
abline(h = t.sd, col="red" )

By way of comparison, a simple linear regression analysis would yield the following estimates:

summary(lm(y~x))

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.067 -12.201  -3.733  14.562  46.192 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.4786     2.6115   0.183    0.855    
x             9.9082     0.1810  54.751   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.47 on 48 degrees of freedom
Multiple R-squared:  0.9842,    Adjusted R-squared:  0.9839 
F-statistic:  2998 on 1 and 48 DF,  p-value: < 2.2e-16

Which method is best, in this context? What are the advantages and disadvantages of each?

References

[366]
J. K. Kruschke, Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan (2nd ed.). Academic Press, 2011.
[368]