I'm just starting to learn Bayesian methods for machine learning (from this book and this class). The whole thing has been opaque so far. I want to share my first "aha!" moment.
Suppose we have some data $(x_i, y_i)$ and we want to do a least squares regression where our line (or hyperplane) passes through the origin (for simplicity). That is, we're looking for a function parameterized by some weights, $f_w(x) = w^Tx$, that minimizes the squared error. That is, we're looking for:
$= \arg\min_{w} {\Vert y - w^TX \Vert}^2$
Sometimes we also add a regularization term. We "penalize" bigger w values. When we have a large number of dimensions, this can prevent overfitting. Let's use L2 regularization:
$= \arg\min_{w} {\Vert y - w^TX \Vert}^2 + \lambda{\Vert w \Vert}^2$
We can also come at this problem from a totally different direction. Namely, let's assume that our y values have actually been drawn from a multivariate normal distribution:
$P(y | w, X) = \mathcal{N}(y | w^TX, \sigma^2I)$
That is, for given weights w and input x, y is drawn from the normal with mean $w^Tx$ and variance $\sigma^2I$. Now suppose we want to find the w that is most likely to have produced the input and labels we see. That is, we want to find:
$\arg\max_{w} P(w | y, X)$
The denominator does not depend on $w$, so we can drop it:
We already worked out $P(y | w, X)$ above. But what is this $P(w|X)$ term? Well, it's reminding us that how likely each weight is given the data depends on how likely we thought each weight was to begin with. Often we just gloss over this by pretending that each choice of w is equally likely a priori.
Although there's no such thing as a uniform distribution over $[-\infty, \infty]$ (because it could not have a finite and nonzero integral), we can kind of pretend there is. Then, multiplying by it would be the same as multiplying by a constant. This would leave the argmax(w) untouched, so we'd only need to find:
This is just saying that the best value of $w$ is precisely the one that would have best generated our labels $y$. We didn't need to do any Bayesian inference for that! But things get a little more interesting if we constrain our $w$ a bit.
Now let us choose a non-uniform prior on the weights. In other words, we will assume that the weights we're trying to learn are more likely to have some values than others, regardless of the data. In particular, we will choose a Gaussian prior centered at 0 and with variance $\gamma^2$:
$P(w) = \mathcal{N}(w|0, \gamma^2I)$
Notice that $P(w)$ does not depend on $x$ (i.e., $P(w|x) = P(w)$), so we can write our argmax above as:
$\arg\max_{w} P(y|w,X)P(w)$
A common trick is to notice that $log(\cdot)$ doesn't change the argmax of a function, so we have:
$\arg\max_{w} \log (P(y|w,X)P(w))$
$= \arg\max_{w} \log P(y|w,X) + \log P(w)$
Now we plug in the formulas for those distributions (with their normalizing constants that we left off before):
$= \arg\max_{w} [ \log c_1 \cdot e^{-\frac{1}{2} (y - w^TX)^T [\sigma^2I]^{-1} (y - w^TX) } + \log c_2 \cdot e^{-\frac{1}{2} w^T [\gamma^2I]^{-1} w} ]$
Suppose we have some data $(x_i, y_i)$ and we want to do a least squares regression where our line (or hyperplane) passes through the origin (for simplicity). That is, we're looking for a function parameterized by some weights, $f_w(x) = w^Tx$, that minimizes the squared error. That is, we're looking for:
$= \arg\min_{w} {\Vert y - w^TX \Vert}^2$
Sometimes we also add a regularization term. We "penalize" bigger w values. When we have a large number of dimensions, this can prevent overfitting. Let's use L2 regularization:
$= \arg\min_{w} {\Vert y - w^TX \Vert}^2 + \lambda{\Vert w \Vert}^2$
We can also come at this problem from a totally different direction. Namely, let's assume that our y values have actually been drawn from a multivariate normal distribution:
$P(y | w, X) = \mathcal{N}(y | w^TX, \sigma^2I)$
That is, for given weights w and input x, y is drawn from the normal with mean $w^Tx$ and variance $\sigma^2I$. Now suppose we want to find the w that is most likely to have produced the input and labels we see. That is, we want to find:
$\arg\max_{w} P(w | y, X)$
This is called the "maximum a posteriori (MAP) estimate," because the above distribution serves as the posterior in a Bayesian inference we're going to do. By Bayes' rule, we can write this as:
$\arg\max_{w} \frac{P(y|w,X)P(w|X)}{P(y|X)}$
The denominator does not depend on $w$, so we can drop it:
$\arg\max_{w} P(y|w,X)P(w|X)$
We already worked out $P(y | w, X)$ above. But what is this $P(w|X)$ term? Well, it's reminding us that how likely each weight is given the data depends on how likely we thought each weight was to begin with. Often we just gloss over this by pretending that each choice of w is equally likely a priori.
Although there's no such thing as a uniform distribution over $[-\infty, \infty]$ (because it could not have a finite and nonzero integral), we can kind of pretend there is. Then, multiplying by it would be the same as multiplying by a constant. This would leave the argmax(w) untouched, so we'd only need to find:
$\arg\max_{w} P(y|w,X)$
This is just saying that the best value of $w$ is precisely the one that would have best generated our labels $y$. We didn't need to do any Bayesian inference for that! But things get a little more interesting if we constrain our $w$ a bit.
Now let us choose a non-uniform prior on the weights. In other words, we will assume that the weights we're trying to learn are more likely to have some values than others, regardless of the data. In particular, we will choose a Gaussian prior centered at 0 and with variance $\gamma^2$:
$P(w) = \mathcal{N}(w|0, \gamma^2I)$
Notice that $P(w)$ does not depend on $x$ (i.e., $P(w|x) = P(w)$), so we can write our argmax above as:
$\arg\max_{w} P(y|w,X)P(w)$
A common trick is to notice that $log(\cdot)$ doesn't change the argmax of a function, so we have:
$\arg\max_{w} \log (P(y|w,X)P(w))$
$= \arg\max_{w} \log P(y|w,X) + \log P(w)$
Now we plug in the formulas for those distributions (with their normalizing constants that we left off before):
$= \arg\max_{w} [ \log c_1 \cdot e^{-\frac{1}{2} (y - w^TX)^T [\sigma^2I]^{-1} (y - w^TX) } + \log c_2 \cdot e^{-\frac{1}{2} w^T [\gamma^2I]^{-1} w} ]$
The constants don't depend on w, so we can discard them. We can also simplify the log of the exponential
$= \arg\max_{w} [-\frac{1}{2\sigma^2} \cdot (y - w^TX)^T (y - w^TX)] + [-\frac{1}{2\gamma^2} \cdot w^T w]$
Notice that $w^T w$ is just another way of writing ${\Vert w \Vert}^2$ (the norm of w squared). Also, let us multiply through by $-2\sigma^2$ (flipping it into a minimization problem in the process):
$= \arg\min_{w} {\Vert y - w^TX \Vert}^2 + \lambda{\Vert w \Vert}^2$
Where $\lambda = \frac{\sigma^2}{\gamma^2}$.
And now we see the punchline: the L2 regularization term comes from (or is at least equivalent to) enforcing a normal prior on our weights.
It may seem like we went through a lot of work just to get a technique we already knew, but it can be an epiphany to realize that some of what seems ad-hoc in ML is actually justified on theoretical grounds. For a more advanced example, you can read about how variational autoencoders can be grounded in a well-known probabilistic framework here.
Notice that $w^T w$ is just another way of writing ${\Vert w \Vert}^2$ (the norm of w squared). Also, let us multiply through by $-2\sigma^2$ (flipping it into a minimization problem in the process):
$= \arg\min_{w} {\Vert y - w^TX \Vert}^2 + \lambda{\Vert w \Vert}^2$
Where $\lambda = \frac{\sigma^2}{\gamma^2}$.
It may seem like we went through a lot of work just to get a technique we already knew, but it can be an epiphany to realize that some of what seems ad-hoc in ML is actually justified on theoretical grounds. For a more advanced example, you can read about how variational autoencoders can be grounded in a well-known probabilistic framework here.
No comments:
Post a Comment