What is Maximum Likelihood Estimation (MLE)? It's simple, but there are some gotchas.
First, let's recall what likelihood is.
Suppose you have a coin with unknown bias b, which is the odds of it landing heads. Suppose we flip it four times and get some outcome x. We can define a function p(x, b) which is a joint probability distribution: it tells us the probability (density) of getting outcome x if the bias is b. We can do a couple of things with it:
1) Hold b constant and treat x as a variable. For example, given that the coin is fair (b=0.5), what are the probabilities for the various outcomes (x=HHHH, x=HHHT, ..., x=TTTT)?
2) Hold x constant and treat b as a variable. For example, given that we flipped HHHT, how likely are the various biases? Intuitively, it seems very unlikely b=0.1 (i.e., that the coin is heavily biased toward tails).
Even though both are probability distributions, we call the first a probability and the second a likelihood. I'm not sure there's any particular reason for that -- it seems like just stats lingo. This is made more confusing by Bayes' formula:
The Maximum Likelihood Estimate is the value of b that was most likely to produce x. In our above example with HHHT, it is intuitive that b=3/4 is the bias most likely to produce that outcome, but how can we prove this?
The probability of flipping m heads amongst n tosses with bias b is:
Because the posterior p(b | x) is "the probability of b given x," it may seem natural to want to maximize it to get the most likely value of b. This is wrong. This is called the maximum a-posteriori (MAP) estimate.
On the other hand, if p(b) = 1 (i.e., our prior is uniform), notice that we can simplify Bayes' rule:
First, let's recall what likelihood is.
Suppose you have a coin with unknown bias b, which is the odds of it landing heads. Suppose we flip it four times and get some outcome x. We can define a function p(x, b) which is a joint probability distribution: it tells us the probability (density) of getting outcome x if the bias is b. We can do a couple of things with it:
1) Hold b constant and treat x as a variable. For example, given that the coin is fair (b=0.5), what are the probabilities for the various outcomes (x=HHHH, x=HHHT, ..., x=TTTT)?
2) Hold x constant and treat b as a variable. For example, given that we flipped HHHT, how likely are the various biases? Intuitively, it seems very unlikely b=0.1 (i.e., that the coin is heavily biased toward tails).
Even though both are probability distributions, we call the first a probability and the second a likelihood. I'm not sure there's any particular reason for that -- it seems like just stats lingo. This is made more confusing by Bayes' formula:
p(b | x) = p(x | b) * p(b) / p(x)Why do we call p(x | b) the "likelihood" when it is read "p of x given b" (suggesting that b is given, as in (1) above)? Because b is not actually given. The way we normally use Bayes' rule, we have some fixed outcome x, and treat p(x | b) as a function of b. We are asking "what is the likelihood of x, given b, for all values of b?"
The Maximum Likelihood Estimate is the value of b that was most likely to produce x. In our above example with HHHT, it is intuitive that b=3/4 is the bias most likely to produce that outcome, but how can we prove this?
The probability of flipping m heads amongst n tosses with bias b is:
b^m * (1-b)^(n-m) * nCmUsing our particular outcome of 3 heads amongst 4 flips, we get:
f(b) = b^3 * (1-b)^1 * 4C3
f(b) = (b^3 - b^4) * 4Feel free to work out the (simple) calculus yourself and see that this is maximized when b = 3/4. Thus it is our maximum likelihood estimate.
Notes
- p(x, b) is a function of both b and x, and is a probability density function (it integrates to one).
- p(x | b) is a function of b and x, but (by design) is only a PDF when b is fixed. That is, when we fix a bias, we want the possible outcomes given that bias to integrate to one.
- Similarly, p(b | x) is a function of b and x, but is only a PDF when x is fixed.
Because the posterior p(b | x) is "the probability of b given x," it may seem natural to want to maximize it to get the most likely value of b. This is wrong. This is called the maximum a-posteriori (MAP) estimate.
On the other hand, if p(b) = 1 (i.e., our prior is uniform), notice that we can simplify Bayes' rule:
p(b | x) = p(x | b) * p(b) / p(x)And since p(x) is constant for fixed x, we have that:
p(b | x) = p(x | b) / p(x)
p(b | x) = p(x | b) * kIn other words, the posterior is a scaled version of the likelihood. Therefore, their maxima (w.r.t. b) coincide. This means that if you don't assume anything about the parameter (i.e., the prior is uniform) and want to know what its MLE is, you can just as well take its MAP.