Why does AlphaZero use Dirichlet?

In the AlphaZero paper, they add Dirichlet noise to the prior probabilities P(s, a) for the root node. Specifically:
Dirichlet noise Dir(α) was added to the prior probabilities in the root node; this was scaled in inverse proportion to the approximate number of legal moves in a typical position, to a value of α = {0.3, 0.15, 0.03} for chess, shogi and Go respectively.
As I understand it, this means that if the P(s, a) vector has n components, then α is also n-dimensional, with each value the same.

What does Dir(0.03) look like? This blog gives some hints. Dir(0.999) looks like this:

Low values are blue, high goes toward red. So this is concentrated at the corners (e.g., the values (0, 0, 1), (0, 1, 0), and (1, 0, 0)). Unfortunately the plot is misleading, in that it looks very tightly concentrated. In actuality, smaller values of α are more concentrated to the corners.

Now look what happens when α > 1. Dir(5):
This is getting warmer near (1/3, 1/3, 1/3). (Recall that this triangle represents all locations where x + y + z = 1.)

As α increases (above 1), it gets more tightly clustered. Dir(50):


From this, we can infer that Dir(0.3) is more tightly concentrated near the corners than Dir(0.999), and Dir(0.03) more still. Also note that Dir(1.0) is a uniform distribution, favoring no particular point.

So the greater the number of available actions, and the smaller the α, the more strongly we try to bias the move toward one of them (at random).

Still, why Dirichlet and not something else? I'm not entirely sure. It does prevent favoring any particular move, but you could accomplish that by any distribution if you draw each action independently. Another benefit is that it ensures that the total bias added always sums to 1 (or really, to the constant ε = 0.25, which they use for scaling), but of course that can be accomplished by scaling the result of any set of distributions. But Dirichlet may be the most straightforward distribution that favors the standard basis vectors (i.e., the corners).


No comments:

Post a Comment

Maximum Likelihood Estimation for dummies

What is Maximum Likelihood Estimation (MLE)? It's simple, but there are some gotchas. First, let's recall what likelihood  is. ...