The Dirichlet distribution is the conjugate prior of the multinomial distribution

Say what? Let's walk through it step by step, starting with a brief recap of Bayesian inference.

Suppose you have a population of people, each with (unknown) probability θ of having a disease. You have some prior assumptions about the distribution of θ (for example, you might say "without any information, I think all values of θ are equally likely"). Now, you draw a sample of N people, out of which x are found to have the disease. From this data, how should you adjust your assumption about the distribution of θ?

You first calculate the probability of drawing this sample (x sick people from N), given θ. This is known as the likelihood, and in this example it's given by a binomial distribution:
P(sample | θ) = (N choose x) * θ^x * (1-θ)^(N-x)
Next, you make explicit your prior assumption about the distribution of θ, simply called the prior. Suppose we think it is uniform:
P(θ) = 1
The uniform distribution happens to also be special case of the beta distribution, which will be relevant in a moment. Beta is a distribution that takes two parameters, and in particular, Uniform = Beta(1, 1).

Using Bayes' rule, you can answer the question: how should I update my knowledge about θ, given that sample? That is, given this sample, what should the updated PDF of θ be?
P(θ | sample) = P(sample | θ) * P(θ) / P(sample)
This is called the posterior distribution. Intuitively, if x << N, our estimate of θ should go from being uniform to being skewed toward a low value.

How do we calculate P(sample)? Note that the posterior P(θ | sample) and likelihood P(sample | θ) are both functions of θ, but the former must be a PDF (and thus integrate to 1) while the latter need not be. Because the LHS of the above equation must integrate to 1, P(sample) can be calculated by taking the integral of the RHS. In general, this is not an easy thing to do, which is why it's convenient to know about conjugate priors.

Suppose we know that "beta is the conjugate prior of binomial." What this means is that if our prior is beta and our likelihood is binomial, then the posterior will still be some beta distribution. Moreover, we can look up the details and find that the parameters of our new beta are:
Beta(a + x, b + N - 1))
Where (a, b) are the parameters from the prior beta. In our case those were (1, 1), so our posterior is given by:
P(θ | sample) = Beta(1 + x, N)
Tada! Much simpler than taking the nasty integral. This neat trick explains why we called our prior Beta(1, 1) instead of just Uniform, even though they're the same thing. Even if we didn't have a uniform prior, choosing something in the Beta family makes our lives easier.

Now consider the multinomial distribution. In the binomial distribution, there are two outcomes (heads or tails; sick or not sick). The multinomial distribution is the generalization to many outcomes. Well, it turns out that the Dirichlet distribution is conjugate prior to multinomial. So in a problem where you know your likelihood is multinomial, it's useful to find a Dirichlet distribution that expresses your prior estimate of your parameters (e.g., the probability of each possible outcome), so that your posterior can be expressed in a simple (Dirichlet) form.


The above are heatmaps of various Dirichlet distributions of three parameters (for when there are three possible outcomes). The triangles are flattened 3d simplexes:


The green surface is all points (x, y, z) where x + y + z = 1. This is important because if x, y, and z represent the odds of getting three different outcomes, their sum should be 1.

Note that Dirichlet(1, 1, 1) (in the upper left) is constant, meaning it's just the Uniform distribution -- where all possible triplets are equally likely. When all three parameters are the same (but not 1), the distribution is symmetrical but skewed either toward the corners or the center (where all three coordinates are equal).

Of course, just because dirichlet helps you calculate your posterior when your likelihood is multinomial, this doesn't mean you have to (or even should) use it. It's just another tool in your belt.

No comments:

Post a Comment

Maximum Likelihood Estimation for dummies

What is Maximum Likelihood Estimation (MLE)? It's simple, but there are some gotchas. First, let's recall what likelihood  is. ...