Maximum Likelihood Estimation for dummies

What is Maximum Likelihood Estimation (MLE)? It's simple, but there are some gotchas.

First, let's recall what likelihood is.

Suppose you have a coin with unknown bias b, which is the odds of it landing heads. Suppose we flip it four times and get some outcome x. We can define a function p(x, b) which is a joint probability distribution: it tells us the probability (density) of getting outcome x if the bias is b. We can do a couple of things with it:

1) Hold b constant and treat x as a variable. For example, given that the coin is fair (b=0.5), what are the probabilities for the various outcomes (x=HHHH, x=HHHT, ..., x=TTTT)?

2) Hold x constant and treat b as a variable. For example, given that we flipped HHHT, how likely are the various biases? Intuitively, it seems very unlikely b=0.1 (i.e., that the coin is heavily biased toward tails).

Even though both are probability distributions, we call the first a probability and the second a likelihood. I'm not sure there's any particular reason for that -- it seems like just stats lingo. This is made more confusing by Bayes' formula:
p(b | x) = p(x | b) * p(b) / p(x)
Why do we call p(x | b) the "likelihood" when it is read "p of x given b" (suggesting that b is given, as in (1) above)? Because b is not actually given. The way we normally use Bayes' rule, we have some fixed outcome x, and treat p(x | b) as a function of b. We are asking "what is the likelihood of x, given b, for all values of b?"

The Maximum Likelihood Estimate is the value of b that was most likely to produce x. In our above example with HHHT, it is intuitive that b=3/4 is the bias most likely to produce that outcome, but how can we prove this?

The probability of flipping m heads amongst n tosses with bias b is:
b^m * (1-b)^(n-m) * nCm
Using our particular outcome of 3 heads amongst 4 flips, we get:
f(b) = b^3 * (1-b)^1 * 4C3 
f(b) = (b^3 - b^4) * 4
Feel free to work out the (simple) calculus yourself and see that this is maximized when b = 3/4. Thus it is our maximum likelihood estimate.

Notes

  • p(x, b) is a function of both b and x, and is a probability density function (it integrates to one).
  • p(x | b) is a function of b and x, but (by design) is only a PDF when b is fixed. That is, when we fix a bias, we want the possible outcomes given that bias to integrate to one.
  • Similarly, p(b | x) is a function of b and x, but is only a PDF when x is fixed.

Because the posterior p(b | x) is "the probability of b given x," it may seem natural to want to maximize it to get the most likely value of b. This is wrong. This is called the maximum a-posteriori (MAP) estimate.

On the other hand, if p(b) = 1 (i.e., our prior is uniform), notice that we can simplify Bayes' rule:
p(b | x) = p(x | b) * p(b) / p(x)
p(b | x) = p(x | b) / p(x)
And since p(x) is constant for fixed x, we have that:
p(b | x) = p(x | b) * k
In other words, the posterior is a scaled version of the likelihood. Therefore, their maxima (w.r.t. b) coincide. This means that if you don't assume anything about the parameter (i.e., the prior is uniform) and want to know what its MLE is, you can just as well take its MAP.


Why the sample variance is unbiased

Have you ever wondered why the sample variance has an $n-1$ in the denominator?

$\hat{\sigma^2} = \frac{1}{n-1} \sum_{i=1}^{n}(X_i - \bar{X})^2$

This video works it out. I'll write out the math here.

First, remember that by linearity of expectation:
$E(\Sigma X) = \Sigma E(X)$
$E(cX) = cE(x)$

And also that the variance of a random variable can be written in terms of expectation like so:
$Var(X) = E(X^2) - E(X)^2$

Rewriting this, we get:
$E(X^2) = Var(X) + E(X)^2$
$E(X^2) = \sigma^2 + \mu^2$

Also recall that $Var(\bar{X}) = \frac{\sigma^2}{n}$ and so:
$Var(\bar{X}) = E(\bar{X}^2) - E(\bar{X})^2$
$E(\bar{X}^2) = Var(\bar{X}) + E(\bar{X})^2$
$E(\bar{X}^2) = \frac{\sigma^2}{n} + \mu^2$

Now let's work out the expectation of the sample variance.
$E[\hat{\sigma^2}] = E[\frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{n-1}]$

We are going to confirm that it's equal to the true variance. Let's just consider the numerator for now. We'll also leave off the summation bounds for ease of reading (and typing...):
$E[\Sigma(X_i - \bar{X})^2]$
$E[\Sigma(X_i^2 - 2X_i\bar{X} + \bar{X}^2)]$
$E[\Sigma(X_i^2) - \Sigma(2X_i\bar{X}) + \Sigma(\bar{X}^2)]$

Because $\bar{X}$ is constant w.r.t. the sum, we can pull it out of the second and third summations:
$E[\Sigma(X_i^2) - 2\bar{X}\Sigma(X_i) + n\bar{X}^2]$

Now since the sample mean $\bar{X} = \frac{\Sigma X_i}{n}$, we can replace $\Sigma X_i = n\bar{X}$:
$E[\Sigma(X_i^2) - 2\bar{X}n\bar{X} + n\bar{X}^2]$
$E[\Sigma(X_i^2) - 2n\bar{X}^2 + n\bar{X}^2]$
$E[\Sigma(X_i^2) - n\bar{X}^2]$
$\Sigma E(X_i^2) - n E(\bar{X}^2)$

And now using the expectations we worked out earlier:
$\Sigma(\sigma^2 + \mu^2) - n (\frac{\sigma^2}{n} - \mu^2)$
$n\sigma^2 + n\mu^2 - \sigma^2 - n\mu^2$
$n\sigma^2 - \sigma^2$
$(n-1)\sigma^2$

So that works out the numerator. When we divide by $(n-1)$, we are indeed left with the variance, $\sigma^2$, making the sample variance $\hat{\sigma^2}$ (as defined above) an unbiased estimator.



Bayesian approach to regularization: an introduction

I'm just starting to learn Bayesian methods for machine learning (from this book and this class). The whole thing has been opaque so far. I want to share my first "aha!" moment.

Suppose we have some data $(x_i, y_i)$ and we want to do a least squares regression where our line (or hyperplane) passes through the origin (for simplicity). That is, we're looking for a function parameterized by some weights, $f_w(x) = w^Tx$, that minimizes the squared error. That is, we're looking for:

$= \arg\min_{w} {\Vert y - w^TX \Vert}^2$

Sometimes we also add a regularization term. We "penalize" bigger w values. When we have a large number of dimensions, this can prevent overfitting. Let's use L2 regularization:

$= \arg\min_{w} {\Vert y - w^TX \Vert}^2 + \lambda{\Vert w \Vert}^2$

We can also come at this problem from a totally different direction. Namely, let's assume that our y values have actually been drawn from a multivariate normal distribution:

$P(y | w, X) = \mathcal{N}(y | w^TX, \sigma^2I)$

That is, for given weights w and input x, y is drawn from the normal with mean $w^Tx$ and variance $\sigma^2I$. Now suppose we want to find the w that is most likely to have produced the input and labels we see. That is, we want to find:

$\arg\max_{w} P(w | y, X)$

This is called the "maximum a posteriori (MAP) estimate," because the above distribution serves as the posterior in a Bayesian inference we're going to do. By Bayes' rule, we can write this as:

$\arg\max_{w} \frac{P(y|w,X)P(w|X)}{P(y|X)}$

The denominator does not depend on $w$, so we can drop it:

$\arg\max_{w} P(y|w,X)P(w|X)$

We already worked out $P(y | w, X)$ above. But what is this $P(w|X)$ term? Well, it's reminding us that how likely each weight is given the data depends on how likely we thought each weight was to begin with. Often we just gloss over this by pretending that each choice of w is equally likely a priori.

Although there's no such thing as a uniform distribution over $[-\infty, \infty]$ (because it could not have a finite and nonzero integral), we can kind of pretend there is. Then, multiplying by it would be the same as multiplying by a constant. This would leave the argmax(w) untouched, so we'd only need to find:

$\arg\max_{w} P(y|w,X)$

This is just saying that the best value of $w$ is precisely the one that would have best generated our labels $y$. We didn't need to do any Bayesian inference for that! But things get a little more interesting if we constrain our $w$ a bit.

Now let us choose a non-uniform prior on the weights. In other words, we will assume that the weights we're trying to learn are more likely to have some values than others, regardless of the data. In particular, we will choose a Gaussian prior centered at 0 and with variance $\gamma^2$:

$P(w) = \mathcal{N}(w|0, \gamma^2I)$

Notice that $P(w)$ does not depend on $x$ (i.e., $P(w|x) = P(w)$), so we can write our argmax above as:

$\arg\max_{w} P(y|w,X)P(w)$

A common trick is to notice that $log(\cdot)$ doesn't change the argmax of a function, so we have:

$\arg\max_{w} \log (P(y|w,X)P(w))$
$= \arg\max_{w} \log P(y|w,X) + \log P(w)$

Now we plug in the formulas for those distributions (with their normalizing constants that we left off before):

$= \arg\max_{w} [ \log c_1 \cdot e^{-\frac{1}{2} (y - w^TX)^T [\sigma^2I]^{-1} (y - w^TX) }  + \log c_2 \cdot e^{-\frac{1}{2} w^T [\gamma^2I]^{-1} w} ]$

The constants don't depend on w, so we can discard them. We can also simplify the log of the exponential

$= \arg\max_{w} [-\frac{1}{2\sigma^2} \cdot (y - w^TX)^T (y - w^TX)] +  [-\frac{1}{2\gamma^2} \cdot w^T w]$

Notice that $w^T w$ is just another way of writing ${\Vert w \Vert}^2$ (the norm of w squared). Also, let us multiply through by $-2\sigma^2$ (flipping it into a minimization problem in the process):

$= \arg\min_{w} {\Vert y - w^TX \Vert}^2 + \lambda{\Vert w \Vert}^2$

Where $\lambda = \frac{\sigma^2}{\gamma^2}$.

And now we see the punchline: the L2 regularization term comes from (or is at least equivalent to) enforcing a normal prior on our weights.

It may seem like we went through a lot of work just to get a technique we already knew, but it can be an epiphany to realize that some of what seems ad-hoc in ML is actually justified on theoretical grounds. For a more advanced example, you can read about how variational autoencoders can be grounded in a well-known probabilistic framework here.

Fast Gaussian blurs and picking random unit vectors

Suppose you have an n-dimensional image that you want to Gaussian blur. This can be achieved by convolving each pixel with a Gaussian kernel. If the kernel has side-length k, then each pixel requires k^n multiplications and additions to convolve (it's just a weighted sum). Can we do it faster?

Now consider another problem: how do you pick a unit vector uniformly randomly from the surface of the n-dimensional unit sphere? That is, how do you pick a random direction in n-d space? Give that a few moments' thought.

It turns out that both answers can be solved by noticing that an n-d Gaussian is the product of n 1-d Gaussians.

For the first problem, consider what happens if instead of doing an n-d convolution, we do n 1-d convolutions sequentially (once along each dimension). Let's say I'm the pixel at (0, 0, ..., 0). Let's compute how much the pixel p at (x1, x2, ..., xn) contributes to me this way.

After the first convolution, it will have contributed to pixel (0, x2, ..., xn), in the amount of f(x1) (where f is the Gaussian). On the second convolution, that pixel will contribute to (0, 0, x3, ..., xn) in the amount of f(x2). The amount of contribution from p is now f(x1) * f(x2). After doing all n dimensions, the contribution will be f(x1) * f(x2) * .... * f(xn). Since $f(x) = e^{-x^2}$:

$$\begin{eqnarray}
f(x_1) \cdot f(x_2) \cdot \ldots \cdot f(x_n) \\
= e^{-x_1^2} \cdot e^{-x_2^2} \cdot \ldots \cdot e^{-x_n^2} \\
= e^{-(x_1^2 + x_2^2 + \ldots + x_n^2)}
\end{eqnarray}$$

What about if we had done the whole n-d convolution at once? Then the contribution from p would have been given by the n-dimensional Gaussian, defined by $f(x_1, x_2, ..., x_n) = e^{-(x_1^2 + x_2^2 + \ldots + x_n^2)}$. This matches what we got using the first technique, because an n-d Gaussian is equivalent to the product of n 1-d Gaussians. The same thing should work for any function $f(x)$ such that $f(x) \cdot f(y) = f(\sqrt{x^2 + y^2})$.

Now, how do we pick a vector uniformly randomly from the surface of the n-d unit sphere? We pick each coordinate independently from the Gaussian and normalize. Why does this work? Consider the probability of picking the point $(x_1, x_2, ..., x_n)$. The probability of picking coordinate $k$ is $e^{-x_k^2}$. Because we picked independently, the overall probability is the product:

$$\begin{eqnarray}
e^{-x_1^2} \cdot e^{-x_2^2} \cdot \ldots \cdot e^{-x_n^2} \\
= e^{-(x_1^2 + x_2^2 + \ldots + x_n^2)}
\end{eqnarray}$$

That should look familiar. Notice that this is a function of the length of the vector $(x_1, x_2, ..., x_n)$. Because it depends only on the length, it is direction-independent, and so is uniform on the sphere.

Depthwise Separable Convolutions

Recently(ish) there has been a development called "depthwise separable convolutions." Unfortunately, it's hard to find a resource that explains clearly what they are. You could just go off of the formulae, of course:

But those look like a pain. Here's a description in words.

Let's say our input is 128x128x16.

Regular convolution: We can think of a 3x3 convolution (with "same" padding) as follows. A single kernel has size 3x3x16 (matching the depth of the input). We walk it over our input as usual (over each of the 128x128 positions). At each step, each element in the kernel is multiplied by the corresponding element in the input, and all 3*3*16=144 elements are summed to produce one output. After walking over the whole input, we have a feature map of size 128x128x1. If we do this 32 times (that is, use 32 kernels), we stack them to get an output of 128x128x32.

Pointwise convolution: This is just a regular convolution, but our kernels are always 1x1(x16).

Depthwise convolution: We use a single kernel. Say it has size 3x3x16. We walk as usual, but instead of each time summing all 3*3*16 elements, we sum the 3*3 elements in each layer, and leave them in their layer. Thus we get a 128x128x16 output even though we have only one kernel.

Depthwise separable convolution: This is just a depthwise convolution followed by a pointwise convolution.

A keras layer that fetches data by index

Suppose you have a set of training images in a numpy array with shape (num_imgs, height, width, channels), and you want your model to take as input not a batch of images, but their indices. Your model will fetch the images using those indices. What might that code look like?

First, your input (the index) is a scalar, but Keras doesn't let you use scalar inputs. Easiest might be to make it have shape (1,):
 inp = Input(shape=(1,))  
Next, you might try to use a Lambda layer to extract the image from the input:
 def fetch_img(x):  
   return x_train[x.flatten()]   
   
 fetch_img = Lambda(fetch_img)(inp)  
If x = [[1], [2], [3]] (i.e., a bunch of arrays of shape (1,)), then we want to turn it into [1, 2, 3], so we flatten it. Recall that x_train[[1, 2, 3]] is the same as x_train[[1, 2, 3], : , : , : ], which selects images 1, 2, and 3.

Next, we don't want to hardcode the use of x_train:
 def fetch_img(x_train):    
   def _fetch_img(x):      
     return x_train[x.flatten()]     
   return _fetch_img    
   
 fetched_imgs = Lambda(fetch_img(x_train))(inp)       
But we still have a problem: x is of type Tensor, and the function given to Lambda must also return a Tensor. We're trying to operate on a numpy array (x_train).

Often the function you see passed to Lambda will appear to be a mathematical operation, but is really an overloaded TF op (e.g., "x + y" is tf.Tensor.__add__(x, y)). If you want to run arbitrary Python code, you have to invoke tf.py_func:
 def fetch_img(x_train):  
   def _fetch_img(x):  
     return tf.py_func(lambda x: x_train[x.flatten()],  
                       [x], tf.float32)    
   return _fetch_img   

Seems like it works! The following assert passes:
  def fetch_img(x_train):   
    def _fetch_img(x):   
      return tf.py_func(lambda x: x_train[x.flatten()],    
                        [x], tf.float32)     
   return _fetch_img   
     
  inp = Input(shape=(1,), dtype=tf.uint16)   
  fetched_imgs = Lambda(fetch_img(x_train))(inp)   
  model = Model(inp, fetched_imgs)   
     
  x_in = np.array([1, 5, 10])   
  res = model.predict(x_in)   
  exp = x_train[x_in]   
  assert np.array_equal(res, exp)
Still one problem though. What happens if we try to evaluate fetched_imgs.shape? It's unknown, meaning that later stages will crash if they try to make use of it. You have to explicitly set the shape:
 def fetch_img(x_train):  
   def _fetch_img(x):  
     res = tf.py_func(lambda x: x_train[x.flatten()], [x], tf.float32)  
     res.set_shape((x.shape[0],) + x_train[0].shape)  
     return res  
   return _fetch_img  
Whew, that should do it.

QM and confusing terminology, redux


I made a post earlier explaining why QM terminology can be so confusing as to prevent you (well, me) from learning it. It was probably too long, so I'm going to make it simpler.

Phase shift

Recall that light is an electromagnetic wave. This means it has (or is) an electric field (E) and a magnetic field (B) that are at right angles to each other:


If you put a measuring device anywhere along the x axis, you'd find that there's an electric field pointing up (or down) with some strength, and a magnetic field pointing left (or right) with proportional strength. They're proportional because they're "in phase." (The values are also oscillating in time, but forget about that for now.)
The electric and magnetic fields in EMR waves are always in phase and at 90 degrees to each other. -- Wikipedia
Now I want you to forget about the magnetic field and focus on the electric field. As a function of x, it is oscillating in one dimension (+z, -z, +z, ...). This is called linear polarization. You could rotate the field so that it's still linearly polarized, but instead of going up, down, up, down it's going upper-right, lower-left, upper-right, lower-left. (Apologies, the axes have been renamed here so that x is now called z.)

Since the x-y plane is two-dimensional, we can break that red wave down into two components: blue for x and green for y. Looked at in this way, the blue and green waves are "in-phase": when we're maximally on the right, we're maximally up; when we're down we're right; when one is zero so is the other. Note that the electric field components being in-phase has nothing to do with the magnetic field being in-phase with it.

We can also push these two "out of phase", so that when x is maximized y is zero, and vice versa. If we do this, we'll actually get a circularly polarized wave:
We say that the green and blue (or x and y) components are "90 degrees out of phase," or that we "phase shifted" the components of the wave with respect to each other. This is called a "relative" phase shift.

You can also "phase shift" the entire wave, by leaving the relative phase of the components the same, and shifting the entire spiral down the z axis. We might call this "absolute" phase shift.

When no context is given, it's almost always assumed the author means a relative phase shift, since that's the only one measurable by experiment.

Now for some gotchas.


Here's a claim from someone who "worked as a physicist at the Fermi National Accelerator Laboratory and the Superconducting Super Collider Laboratory," who explains that what's changing is the relative phase of the $E$ and $B$ fields:
The circularly polarized wave can be expressed as two linearly polarized waves, shifted by 90° in phase and rotated by 90° in polarization. If you pick some direction to measure the fields along, the components of E and B along that direction have a 90° phase shift with respect to each other. A phase shift of 90° means that as peaks B becomes zero, and as peaks becomes zero.
As far as I can tell, this is just wrong.


Here's another: from Wikipedia, among many other sources:
"Light waves change phase by 180° when they reflect [off a mirror]."
If the relative phase of light were to change by 180°, then it would change from upper-right, lower-left... to upper-left, lower-right.... This would be easy to detect with a polarizing filter, but it doesn't happen. That's because they're talking about absolute phase. Good luck finding anyone who will explain that.



Finally, recall the two-slit experiment.

"When the two waves are in phase... the summed intensity is maximum, and when they are in anti-phase... then the two waves cancel and the summed intensity is zero. This effect is known as interference."
Which phase are they talking about here? Neither, of course. This is referring to the quantum wave function, where the "wave" is a complex-valued function representing the "probability amplitude" that helps you figure out where the photon might be found.

Which brings us to....

Amplitude

Recall that a single complex number $c$ can be written as: $c = re^{i\theta}$.

Now, amplitude can refer to three different things here:
  1. The key innovation of QM is that we use "probability amplitudes." These refer to the complex number ($c$) itself.
  2. But complex numbers themselves have "amplitudes", usually referring to $r$ above.
  3. Some authors even use it to refer to $\theta$ ("The argument is sometimes also known as the phase or, more rarely and more confusingly, the amplitude")
To recap, a complex number $c = re^{i\theta}$ is an "amplitude," but it also has an amplitude, which can refer to either $r$ or $\theta$. One wonders if some day they'll use it to refer to $i$ and $e$, too.

Physicists can't decide whether they like (1) or (2). You commonly run into both 1:
"In quantum mechanics, a probability amplitude is a complex number used in describing the behaviour of systems." -- Wikipedia
And 2:
"The probability of getting any particular eigenvalue is equal to the square of the amplitude for that eigenvalue." -- Quantum Physicist Sean Carroll.
You first read (1) and when you get to (2) you think "wait, you can't just square a complex number and hope to get a real number (probability)." Amplitude must mean length. So you look up the relevant equation for "square of the amplitude":
Those bars are called "norm," or "length," or the "amplitude." So now whenever you detect usage (2) you mentally replace it with "norm," a concept from vector spaces. This begins to reinforce that terrible intuition they teach you in high school, of complex numbers being Real vectors.

How did you get that "vector" again? By taking the inner product of two complex vectors. That's funny, I thought the inner product was supposed to yield a scalar. No matter, let's just internalize this rule: the inner product of two complex vectors is another vector....

From the other side, maybe you have a hard time visualizing a vector in C^2 (i.e., a pair of complex numbers). So you mentally visualize it as a real vector. What is the inner product of two real vectors? It can be thought of as the length of the projection of one onto the other. So now you reinforce the intuition of the inner product as a (real) length.

So now you can't remember whether the inner product should yield a scalar (a + bi), a vector (a, b), or the length of that vector.

This is not a recipe for success.

Conclusion

Keep your concepts straight, or else you'll end up in an an abyss.

Maximum Likelihood Estimation for dummies

What is Maximum Likelihood Estimation (MLE)? It's simple, but there are some gotchas. First, let's recall what likelihood  is. ...