Introduction to Bayesian statistics

Below we provide a non-technical introduction to Bayesian statistics. We first explain how hypotheses about population parameters (e.g., population proportions, means, or regression coefficients) can be tested with Bayes factors, and how Bayes factors can be used to obtain the posterior odds of hypotheses. Next, we explain how population parameters can be estimated using posterior distributions, point estimates and credible intervals. Although a few formulas and computations are included, they are all explained in a non-technical way as well. The introduction is perfect for those without a technical background who would like to gain a basic understanding of Bayesian statistics .

Bayes factor

In statistics, we are often interested in how much evidence observed sample data contain for or against hypotheses about population parameters. For example, we could be interested in whether a population proportion is equal to or different from 0.5, or whether two population means are equal or not. In Bayesian statistics, evidence in data for or against hypotheses can be quantified through Bayes factors. A Bayes factor is a ratio of how likely the observed data are under one hypothesis, to how likely the observed data are under another, competing, hypothesis. The higher the likelihood of the data under the first hypothesis as compared to the likelihood under the second hypothesis, the stronger the evidence in the data for the first hypothesis over the second hypothesis. Likewise, the higher the likelihood of the data under the second hypothesis as compared to the likelihood under the first hypothesis, the stronger the evidence in the data for the second hypothesis over the first hypothesis. When the likelihood of the data is the same under both hypotheses, the data do not support one hypothesis over the other, and the Bayes factor is equal to 1. The amount of evidence the data contain is affected by several factors, including the sample size and the amount of random variation in the data.

We will illustrate the interpretation and computation of the Bayes factor with a simple fictious example, where we are interested in the proportion of people in the population with a dominant personality. Specifically, suppose that for the U.S. society to be balanced, it is important that 20% of the people are dominant. Too many dominant people would lead to conflict, and too few dominant people would lead to lack of leadership. Therefore, we are interested in whether the proportion of dominant people in the U.S. population is $0.2$ - we will call this the null hypothesis - or different from $0.2$ - the alternative hypothesis. In this introduction, we will denote the population proportion of dominant people by $\pi$. Hence, we can formulate the null and alternative hypothesis as:

Since we don't have data on the personality characteristics of all U.S. citizens, we collect a random sample of $10$ people and measure whether they have a dominant personality. In the sample, $X = 4$ out of the $10$ people are dominant - a sample proportion of $4/10 = 0.4.$ We now want to know how likely the observed sample count of $X = 4$ is under the null hypothesis, and how likely it is under the alternative hypothesis. The ratio of these two likelihoods is the Bayes factor and gives us the evidence in the data for the null hypothesis against the alternative hypothesis.

We will start with the likelihood under the null hypothesis. In order to compute the likelihood of the observed sample count under the null hypothesis, we first need to know the probability distribution of the sample count $X$ under the null hypothesis. Since our data are dichotomous (someone is either dominant or not dominant) and randomly drawn from the population, we can say that the sample count $X$ of dominant people follows a binomial($n$,$P$) distribution, where:

The probability distribution of the sample count $X$ under the null hypothesis - the binomial(10, 0.2) distribution - is shown graphically below.

Distribution of X given that population proportion is 0.2

The probability distribution of the sample count $X$ of dominant people under the null hypothesis tells us that, if the null hypothesis that $\pi = 0.2$ were true, the probability of observing $X = 4$ dominant people in our sample is $p(X = 4|\pi = 0.2) = 0.08808.$ You can click on the bar at $X = 4$ in the graph above to verify the probability of $0.08808$ under the null hypothesis.

For readers without a background in calculus: the vertical line $|$ in the expression above means 'given that'. So $p(X = 4|\pi = 0.2)$ means 'the probability that $X = 4$ given that $\pi = 0.2$.'

Next, in order to compute the Bayes factor, we also need to compute the likelihood of the data under the alternative hypothesis. Remember that our alternative hypothesis stated that the population proportion of dominant people $\pi$ is different from $0.2.$ To keep things simple at first, we will, for now, reduce the alternative hypothesis to a single value, and compute a simple likelihood ratio instead of a more complex Bayes factor. So let's say that the alternative hypothesis specifically states that the population proportion of dominant people $\pi$ equals $0.5.$ With the alternative hypothesis defined in this way, the sample count $X$ of dominant people follows a binomial($n$,$P$) distribution with number of trials $n$ equal to $n = 10$ and success probability $P$ equal to $0.5$ if the alternative hypothesis were true. The probability distributions of the sample count $X$ of dominant people under both the null and the alternative hypothesis are shown graphically below (press the Reset button if you have already moved the sliders). The probability distribution of the sample count $X$ of dominant people under the alternative hypothesis tells us that, if the alternative hypothesis that $\pi = 0.5$ were true, the probability of observing $X = 4$ dominant people in our sample is $p(X = 4|\pi = 0.5) = 0.20508.$


n =
X =

$H_0$: $\pi$ =
$H_1$: $\pi$ =

Now that we have computed the likelihood of the data under the null hypothesis and the alternative hypothesis, we can compute the ratio of the two likelihoods, which is equal to $L_{01} = 0.08808/0.20508 = 0.43.$ This means that the data are $0.43$ times as likely to occur under the null hypothesis than under the alternative hypothesis. The subscripted [01] in $L_{01}$ indicates that the null hypothesis is in the numerator and the alternative hypothesis is in the denominator. If the likelihood ratio is less than 1, it can ease the interpretation to invert it, i.e. put the alternative hypothesis in the numerator and the null hypothesis in the denominator. For the example data this results in $L_{10} = 0.20508/0.08808 = 2.33$ ($L_{10}$ can also be computed as $1/L_{01}$). The likelihood ratio $L_{10}$ tells us that the data are $2.33$ times more likely under the alternative hypothesis than under the null hypothesis. That is, the data support the alternative hypothesis by a factor of $2.33$. You can move the sliders in the above graph to see how the probability distributions and likelihood ratio change with different sample sizes $n$, different numbers of dominant people $X$ in the sample, and different values for the population proportion of dominant people $\pi$ under the null and alternative hypothesis.

Thus far we have only computed a simple likelihood ratio, where both the null and the alternative hypothesis consisted of a single value for the population proportion of dominant people $\pi.$ We will now move one step further, and let the alternative hypothesis consist of three different values for $\pi.$ Specifically, let's say that the alternative hypothesis states that the population proportion of dominant people $\pi$ may either be equal to $\pi = 0.3,$ to $\pi = 0.4,$ or to $\pi = 0.5.$ The null hypothesis remains the same and states that the population proportion of dominant people $\pi$ is equal to $\pi = 0.2.$ In order to compute the new likelihood ratio, which quantifies the evidence in our observed sample data for the null hypothesis over the new composite alternative hypothesis, we again need to find the likelihood of our sample data under both hypotheses.

The likelihood of our sample data under the null hypothesis hasn't changed, and is still equal to $p(X = 4|\pi = 0.2) = 0.08808.$ The likelihood of our sample data under the composite alternative hypothesis can be computed in the following way:

  1. Compute the likelihood of the data under each value for $\pi$ that is part of the alternative hypothesis.
  2. Compute the weighted average of these likelihoods. The weighted average of likelihoods under the alternative hypothesis is called the marginal likelihood of the data under the alternative hypothesis.

The likelihood of our sample data under each value for $\pi$ that is part of the composite alternative hypothesis can be computed using the binomial($n$,$P$) distribution as before, with $n = 10$ (the sample size) and $P$ equal to $\pi = 0.3$, $\pi = 0.4$, and $\pi = 0.5$ respectively. The respective likelihoods are $$ \begin{aligned} p(X = 4 | \pi = 0.3) &= 0.20012,\\ p(X = 4 | \pi = 0.4) &= 0.25082,\\ p(X = 4 | \pi = 0.5) &= 0.20508. \end{aligned}$$ You can verify this by moving the slider for $H_1$ above to $\pi = 0.3,$ $\pi = 0.4,$ $\pi = 0.5,$ respectively, and clicking on the bar at $X = 4$ in the right hand graph.

The weights that we use to compute the weighted average of the likelihoods under the composite alternative hypothesis should reflect our beliefs - before observing the sample data - about the relative plausibility of the different values of the population proportion $\pi.$ The beliefs may be based on all kinds of considerations, like previous research, expert knowledge, scale boundaries, etc. For our example we will use the weights $0.5$, $0.3$, and $0.2$ for the hypothesized values $\pi = 0.3$, $\pi = 0.4$, and $\pi = 0.5$, respectively. These weighting values reflect the belief that population proportions of dominant people closer to 0.2 are more plausible than population proportions farther away from 0.2. Note that the weighting values add up to 1, and can therefore be thought of as the probabilities of the different hypothesized $\pi$ values being true.

Quantifying the relative plausibility of parameter values as probability is allowed in Bayesian statistics, but not in classical statistics. Classical statistics only allows probability statements about random events that have a long run frequency distribution, like the sample count of 'successes' $X.$ Bayesian statistics, on the other hand, allows probability statements about anything we are uncertain about, including the value of population parameters. This broader definition of probability makes computations possible that are not possible in classical statistics.

With the likelihoods and weighting values defined, we can compute the marginal likelihood of the data under the composite alternative hypothesis as the weighted average of the likelihoods. That is, we compute the sum of the likelihoods multiplied by their weighting values: $$\begin{aligned} p(X = 4|H_1) &= \sum p(X = 4|\pi_i)p(\pi_i)\\ &= (0.20012 \times 0.5) + (0.25082 \times 0.4) + (0.20508 \times 0.2)\\ &= 0.216322.\end{aligned}$$ Now that we have computed the likelihood of our sample data under the null hypothesis and the marginal likelihood of our sample data under the composite alternative hypothesis, we can compute the marginal likelihood ratio for the null versus the composite alternative hypothesis, which is equal to $0.08808/0.216322 = 0.40717.$ Inverting it tells us that the data are $1/0.40717 = 2.45$ times more likely under the composite alternative hypothesis than under the the null hypothesis. That is, the data support the composite alternative hypothesis by a factor of $2.45.$

The principle we used to compute the marginal likelihood of the data under an alternative hypothesis consisting of three different values for the population proportion $\pi,$ can be easily extended to an alternative hypothesis consisting of a continuous range of values. For our initial alternative hypothesis, which stated that the population proportion of dominant people $\pi$ is different from $0.2$, possible values for $\pi$ range from $\pi = 0$ to $\pi = 1$ (a proportion can't be smaller than 0 or larger than 1). The accompanying weighting values are contained in a continuous weighting distribution. This weighting distribution is also known as the prior distribution, because it reflects our beliefs about the relative plausibility of different values for $\pi$ before observing our sample data.

A frequently asked question is why the null value is included in the alternative hypothesis. Keep in mind that the weighting distribution is a continuous distribution, and therefore the area under the weighting distribution at the null value is zero. Only ranges of values can have a non-zero area under a continuous distribution.

To find the marginal likelihood of the sample data under our initial alternative hypothesis, we conceptually follow the same procedure as we did with the discrete alternative hypothesis consisting of only three values for $\pi.$ Specifically, we:

  1. Define a likelihood function that gives the likelihood of the data under each value for $\pi$ that is part of the alternative hypothesis.
  2. Compute the weighted average of the likelihoods, where the weights are contained in the weighting distribution. In the continuous case, the weighted average of the likelihoods is computed through a mathematical technique called integration.

A weighting distribution and the likelihood function for our sample data are shown graphically below (press the Reset button if you have already moved the sliders). By clicking at different places on the likelihood function you see the likelihood of our sample data under different values for the population proportion $\pi.$ The function shows, for example, that $p(X = 4|\pi = 0.3) = 0.20012,$ $p(X = 4)|\pi = 0.5) = 0.20508,$ and $p(X = 4)|\pi = 0.7) = 0.036757.$ The same likelihoods can be obtained by moving the slider for $H_1$ in the previous graph to $\pi = 0.3,$ $\pi = 0.5,$ and $\pi = 0.7,$ respectively, and clicking on the bar at $X = 4$ in the right hand graph. The weighting distribution is the beta(2, 5) distribution and reflects a prior belief that population proportions $\pi$ closer to 0.2 are more plausible than population proportions farther away from 0.2. It causes the likelihood of the data under $\pi$ values closer to 0.2 to receive more weight than the likelihood of the data under $\pi$ values farther away from 0.2, in the computation of the weighted average of the likelihoods, i.e. the marginal likelihood.

With the weighting distribution defined as in the graph below, the marginal likelihood of our sample data under the alternative hypothesis is equal to $p(X = 4|H_1) = \int p(X = 4|\pi)p(\pi) d\pi = \mbox{weighted average of the likelihoods} = 0.13112.$ The computation may look a bit intimidating, but is conceptually similar to how we computed the marginal likelihood under the discrete alternative hypothesis consisting of only three possible values for $\pi$.


n =
X =

$H_0$: $\pi$ =

Change the parameters of the beta(a, b) weighting distribution/prior distribution:
a =
b =

Weighting distribution/prior distribution

As before, the likelihood of our sample data under the null hypothesis that $\pi = 0.2$ is equal to $p(X = 4|\pi = 0.2) = 0.08808.$ The marginal likelihood ratio for the null versus the alternative hypothesis is therefore equal to $0.08808/0.13112 = 0.67175.$ This marginal likelihood ratio is the Bayes factor, which we will denote by $B.$ Inverting it tells us that the data are $1/0.67175 = 1.49$ times more likely under the alternative hypothesis than under the the null hypothesis. That is, the data support the alternative hypothesis over the null hypothesis by a factor of $1.49.$ You can move the sliders above to see how the weighting distribution, likelihood function, and Bayes factor change with different sample sizes $n$, different numbers of dominant people $X$ in the sample, different values for $\pi$ under the null hypothesis, and different weighting distributions for $\pi$ under the alternative hypothesis.

As you may have noticed while changing the weighting distribution for $\pi$ under the alternative hypothesis with the sliders, different weighting distributions result in different marginal likelihoods, which result in different Bayes factors. Hence it is important to choose a reasonable weighting distribution, that has higher density at parameter values that are a priori more plausible, and lower density at parameter values that are a priori less plausible ('a priori' means before observing your sample data). A weighting distribution under the alternative hypothesis that puts too much (relative) weight on implausible parameter values will result in a Bayes factor that supports the null hypothesis over the alternative hypothesis.

When using existing software to compute a Bayes factor, you won't have the freedom to choose any type of weighting distribution you like. This has to do with mathematical considerations. However, you will have the option to change the parameters of the pre-defined weighting distribution just like you had in our example above, hereby still allowing you to adapt the shape of the weighting distribution.

Thus far, we have explained how the Bayes factor is computed when there is one unknown population parameter - in our example the population proportion of dominant people $\pi$. In many cases, however, there are one or more nuisance parameters in addition to the parameter of interest. For instance, we could be interested in a population mean $\mu$, in which case the unknown population standard deviation $\sigma$ would be a nuisance parameter. In Bayesian statistics, we place weighting distributions on nuisance parameters just like we do on our parameter of interest, which are then averaged out through integration under both the null and the alternative hypothesis. When using existing software to compute a Bayes factor for your data, nuisance parameters are usually taken care of by your computer program so you don't need to worry about them.

Interval null hypotheses

Sometimes researchers do not care so much about whether a parameter is exactly equal to a specific value or not, but are more interested in whether the parameter is inside a certain range of values or outside that range. For instance, a researcher could be interested in whether a standardized mean difference $\delta$ is between $-0.2$ and $0.2$ (a range of for the researcher negligible effect sizes - the interval null hypothesis) or outside that range (the alternative hypothesis). Although the option is not yet provided by all statistical software, Bayes factors are in principle not restricted to testing point null hypotheses, and can be used to test interval null hypotheses as well.

Posterior odds of hypotheses

The Bayes factor testing a null hypothesis against an alternative hypothesis should not be confused with the posterior odds of the null hypothesis being true over the alternative hypothesis being true. The posterior odds are the odds of the null hypothesis being true over the alternative hypothesis being true, after observing the sample data. In contrast, the Bayes factor is the ratio of the likelihood of the sample data under the null hypothesis to the likelihood of the sample data under the alternative hypothesis: $$ \begin{aligned} \mbox{Bayes factor} &= \frac{p(\mbox{data} | H_0)}{p(\mbox{data} | H_1)},\\[5mm] \mbox{posterior odds} &= \frac{p(H_0 | \mbox{data})}{p(H_1 | \mbox{data})}. \end{aligned} $$ We can compute the posterior odds from the Bayes factor, however, by using a formula derived from Bayes theorem. By Bayes theorem: $$ \begin{aligned} p(H_0 | \mbox{data}) &= p(\mbox{data} | H_0) \times p(H_0) / p(\mbox{data}),\\[2mm] p(H_1 | \mbox{data}) &= p(\mbox{data} | H_1) \times p(H_1) / p(\mbox{data}). \end{aligned} $$ From this we can derive that: $$ \begin{aligned} \frac{p(H_0 | \mbox{data})}{p(H_1 | \mbox{data})} &= \frac{p(\mbox{data} | H_0)}{p(\mbox{data} | H_1)} \times \frac{p(H_0)}{p(H_1)},\\[4mm] \mbox{posterior odds} &= \mbox{Bayes factor} \times \mbox{prior odds}. \end{aligned} $$ That is, the posterior odds are equal to the Bayes factor multiplied by the prior odds. The prior odds represent the relative belief in the two hypotheses before observing the sample data and may differ from person to person.

The expression above shows that the Bayes factor indicates by what amount a rational observer should change their prior odds to obtain their posterior odds based on the observed data. For example, suppose that before observing the sample data a person believed that the alternative hypothesis was twice as plausible as the null hypothesis. The sample data result in a Bayes factor of $B_{01} = 0.6/0.1 = 6$, indicating that the data support the null hypothesis over the alternative hypothesis by a factor of 6. Then the person's posterior odds should be $\frac{0.6}{0.1} \times \frac{1}{2} = 3.$ Only when the prior odds are 1, the posterior odds are numerically equal to the Bayes factor.

Interpreting the Bayes factor as the amount by which the prior odds should be changed to obtain the posterior odds also gives a useful context to determine when a Bayes factor is 'large' or 'small'. Whether a Bayes factor implies strong, moderate, or weak evidence depends not only on its value, but also on the prior beliefs it is modifying. A Bayes factor of 0.05 in favor of smoking being healthy is small if it is modifying prior odds of 1000000/1 in favor of smoking being unhealthy. However, a Bayes factor of 0.05 is large if it is modifying prior odds of, say, 2.

Posterior distributions, point estimates, and credible intervals

Thus far, we have only talked about Bayesian hypothesis testing. Often, however, we are also interested in estimating the size of population parameters like population proportions, means, or regression coefficients. In Bayesian statistics, the size of a population parameter is estimated by using both the information in the sample data - reflected in the likelihood function - and our beliefs about the plausibility of different parameter values before observing the data - reflected in the prior distribution of the parameter. The two pieces of information are combined using Bayes theorem, resulting in a distribution that indicates which parameter values are plausible after observing the sample data. This distribution is called the posterior distribution of the parameter. The posterior distribution can in turn be used to obtain a point and interval estimate of the parameter of interest.

We will illustrate the computation and interpretation of posterior distributions, point estimates, and interval estimates with the same fictious example we used in our discussion of the Bayes factor. Here we were interested in the population proportion $\pi$ of people with a dominant personality. We had a random sample of $n = 10$ people from the population, of whom $X = 4$ were dominant - a sample proportion of $4 / 10 = 0.4$. We use Bayes theorem to combine the information in the sample data and our beliefs about the size of the population proportion before observing the data into the posterior distribution of the population proportion $\pi.$

By Bayes theorem: $$ p(\pi | \mbox{data}) = p(\mbox{data} | \pi) \times p(\pi) / p(\mbox{data}). $$ The term $p(\mbox{data})$ in the denominator is the marginal likelihood of the data and only serves as a normalizing constant. Hence we can simplify the above equation a bit as: $$ \begin{aligned} p(\pi| \mbox{data}) &\propto p(\mbox{data} | \pi) \times p(\pi),\\[2mm] \mbox{posterior probability of } \pi &\propto \mbox{likelihood} \times \mbox{prior probability of } \pi . \end{aligned} $$ The symbol $\propto$ stands for 'proportional to', which means equal up to a constant. The above equation shows that the posterior distribution of $\pi$ is proportional to the product of the likelihood function and the prior distribution of $\pi$. Our prior distribution of the population proportion $\pi$, the likelihood function, and the resulting posterior distribution are shown graphically below (press the Reset button if you have already moved the sliders).

n =
X =

Change the parameters of the beta(a, b) prior distribution:
a =
b =

Prior distribution

Note that the prior distribution we use for the computation of the posterior distribution is the same as the weighting distribution/prior distribution we used in our discussion of the Bayes factor, for the computation of the marginal likelihood of the data under the alternative hypothesis. Also the likelihood functions are the same. This time, however, we are interested in the product of the likelihood function and the prior distribution, not in the weighted average of the likelihoods.

The posterior distribution shows the relative plausibility of different population proportions $\pi$, after observing the sample data. The mean of the posterior distribution - the posterior mean - can be used as point estimate of $\pi$. The posterior mean of $\pi$ is represented by a gray vertical line in the middle of the posterior distribution and is equal to $0.353$. The interval containing the middle 95% area under the posterior distribution, known as a 95% credible interval, is often used as interval estimate. It indicates between which values the population parameter lies with 95% probability. The bounds of the 95% credible interval for $\pi$ are represented by red vertical lines in the tails of the posterior distribution, and are equal to $0.152$ (left bound) and $0.587$ (right bound). That is, with 95% probability the population proportion lies between $0.152$ and $0.587.$ You can move the sliders to change the data and prior distribution, and see how this affects the posterior distribution.

For readers familiar with classical statistics, the Bayesian credible interval may seem quite similar in interpretation to the classical confidence interval. The correct interpretation of the classical confidence interval is, however, quite different from the interpretation of the Bayesian credible interval. Morey et al. (2015) describe common misconceptions about confidence intervals and explain how they actually should be interpreted. We believe that generally, Bayesian credible intervals are more useful in science than classical confidence intervals.

As you may have noticed while changing the sliders, the more we 'flatten out' the prior distribution, the smaller its influence is on the posterior distribution as compared to the influence of the data (set $a$ and $b$ to 1 to make the prior completely flat). Flat or 'non-informative' priors can be used in Bayesian parameter estimation if there is no prior knowledge about the size of the parameter, and the influence of the prior is to be minimized. Flat priors should generally not be used, however, as a weighting function for the parameter of interest in Bayes factors, unless the researcher truly believes that all possible values of the parameter are equally plausible. A weighting distribution under the alternative hypothesis that puts too much weight on unrealistic parameter values will result in a small marginal likelihood, and hence in a Bayes factor that favors the null hypothesis.

In the text above we focused on Bayesian parameter estimation in the situation where there is a single unknown population parameter - the population proportion of dominant people $\pi.$ If there are one or more nuisance parameters in addition to the parameter of interest, we place prior distributions on the nuisance parameters just like we do on our parameter of interest, which are then averaged out through integration. This is equivalent to how we handle nuisance parameters when computing Bayes factors. As with the Bayes factor, when using existing software nuisance parameters will usually be taken care of by your computer program.

References

Morey, R., Hoekstra, R., Rouder, J., Lee, M., & Wagenmakers, E.-J. (2015). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 1–21. http://doi.org/10.3758/s13423-015-0947-8