A short introduction to the use of improper priors in Bayesian statistics.

While I’ve purchased improperprior.com as my personal website, it seems irresponsible not to write a post giving a short explaining improper priors.

This post was originally published on 2020-03-28, it was updated on 2021-07-24 to improve formatting and clarity.

In the Bayesian framework we presume our data is generated by some distribution with a given set of parameters. Further, we assume that the parameters themselves are drawn from a distribution called a prior. Typically the prior is set before observing any data (unless you’re into Empirical Bayes), and then observed data is used to generate a new distribution, the posterior, which is said to be a compromise between the prior and the observed data.

This process is powered by Bayes’ theorem. For example, suppose we have data \(Y\) that follow a Binomial distribution, and we would like to estimate the \(\theta\) parameter. Using Bayes’ theorem we would compute:

\[\begin{equation} \tag{1} P(\theta|Y) = \frac{L(Y|\theta)P(\theta)}{P(Y)}. \end{equation}\]

Where:

- \(P(\theta|Y)\) is the posterior probability, this is what we’d like to estimate.
- \(L(Y|\theta)\) is the likelihood of observing the data, given \(\theta\).
- \(P(\theta)\) is a distribution representing a prior guess for \(\theta\) before observing data.
- \(P(Y)\) is the unconditional likelihood of observing the data, also commonly called a normalizing constant. We will ignore this term as it is constant once the data has been observed, and only acts to make sure that the numerator integrates to 1 (i.e. it makes sure the posterior is a
*proper*distribution), which is surprisingly unnecessary for posterior estimation.

The canonical prior for data from a binomial distribution is the beta distribution. This is for good reason, the beta distribution has support between 0 and 1 (bounded or not!) and is very versatile with respect to it’s potential shapes.

For this example we’ll use a flat prior, which gives equal weight to all possible values of \(\theta\). The \(\text{Beta}(1, 1)\) distribution does this by just giving a Uniform distribution over \([0, 1]\).

The likelihood is just the probability that we observe the data sampled data. The likelihood for binomial data is just a binomial distribution itself. Setting aside constant terms the likelihood is just \(\theta^{\sum_i y_i}(1-\theta)^{n-\sum_i y_i}\). We can complete the numerator of (1) by combining this likelihood with the \(\text{Beta}(1, 1)\) prior, which gives:

\[\begin{equation} \tag{2} P(\theta|Y) \propto \theta^{\sum_i y_i}(1-\theta)^{n-\sum_i y_i} \times \theta^{1-1}(1-\theta)^{1-1} \\ \propto \theta^{\sum_i y_i + 1 - 1}(1-\theta)^{n-\sum_i y_i + 1 - 1}. \end{equation}\]

From this we can see that the posterior distribution follows a \(\text{Beta}(\sum_i y_i + 1, n - \sum_i y_i + 1)\) distribution. Intuitively because we used a prior that carried little information with it (effectively saying that \(\theta\) could be any value in (0, 1)) our posterior estimate for \(\theta\) is entirely determined by the data.

Here’s a simple Shiny app to let you play with this model and build some intuition as to how the prior parameters and data interact.

So far this seems like a great framework, but the requirement that the prior be a proper distribution can be quite restrictive. Consider the normal distribution (for simplicity, assume known \(\sigma^2\)): \(\mu\) can take *any* real value, so how can we use a flat prior with equal probability for each possible value of \(\mu\)?

While powerful in specific cases, Bayesian modeling is rather limited if we can only use proper distributions as priors.

Naively, what would happen if we just set the probability of each \(\mu\) to 1? Well it turns out that we can do exactly this - we can use any prior, even an *improper prior* as long as the posterior comes out to be a proper distribution.

Choosing an improper prior that generates a valid posterior can be a tricky affair, but using Jeffreys’ prior is a good place to start. Continuing the normal example, we will just use a prior probability of 1 for every value of \(\mu\). This is actially proportional to the Jeffreys’ prior for this setup. As with the previous example, we will set aside all constant terms:

\[ P(\theta|Y) \propto \exp{\left[-\frac{1}{2} \left(\frac{y-\mu}{\sigma^2}\right)^2\right]} \times 1. \]

The posterior is just a proper normal distribution!

It is of course possible to go much deeper into improper priors, particularly choosing a good prior, but as far as the concept goes, this is mostly all there is to it!

Craig Gidney has a nice blog post walking through a slightly more technical example of improper priors. Likewise, Andy Jones has a great podcast with a few additional examples. For a more general treatment A First Course in Bayesian Statistical Methods by Hoff and Bayesian Data Analysis by Gelman et al are the standard introductory Bayesian statistics textbooks.

As ever, Wikipedia has very detailed articles on priors, more suitable for reference than learning:

For attribution, please cite this work as

Ewing (2020, March 16). Improper Prior | Ben Ewing: What is an improper prior?. Retrieved from https://improperprior.com/posts/2020-03-16-what-is-an-improper-prior/

BibTeX citation

@misc{ewing2020what, author = {Ewing, Ben}, title = {Improper Prior | Ben Ewing: What is an improper prior?}, url = {https://improperprior.com/posts/2020-03-16-what-is-an-improper-prior/}, year = {2020} }