I realized I was a little confused about the details of the algorithm, and finally figured out what was confusing me.
Let's briefly revisit regular MCTS:
In AlphaZero there is no rollout, and a modified UCB (called PUCT) is used:
From this, it seems like "expand s" means "create s's children." This agrees with the text in the paper:
The circled node is the "leaf node" described in step b. But notice that it doesn't yet exist during the tree-walking phase (a). We created it during step b, and so "expand" s here must mean to create s (not its child nodes).
So it seems the paper describes the algorithm in two slightly different ways, using the word "expand" differently:
Maybe it's obvious while reading the paper that they're the same thing, but it sure confused me.
Let's briefly revisit regular MCTS:
- Starting at root, recursively invoke UCB on children (picking child with highest score) until you hit a leaf node.
- On the node's first visit, perform a rollout (i.e., randomly simulated moves until end of game) to estimate node value score (v).
- On the second visit, expand (i.e., create) its children, and visit+rollout one of them.
- There is no third visit, since it's no longer a leaf node.
In AlphaZero there is no rollout, and a modified UCB (called PUCT) is used:
- Invoke PUCT on children until you hit a leaf node (s).
- On the node's first visit, invoke NN to estimate v(s) (node value) and p(s) (probabilities for each next action, used in PUCT). Also expand it (i.e., create but do not visit children).
- There is no second visit, since it's no longer a leaf node.
From this, it seems like "expand s" means "create s's children." This agrees with the text in the paper:
"Each simulation starts from the root state and iteratively selects moves that maximize an upper confidence bound Q(s, a)+U(s, a) until a leaf node s′ is encountered. This leaf position is expanded and evaluated only once by the network to generate both prior probabilities and evaluation, (P(s′, ·),V(s′))= fθ(s′)."So the leaf node is one that already exists, and expanding it creates its children. But now look at Figure 2:
"MCTS in AlphaGo Zero.
a (Select): Each simulation traverses the tree by selecting the edge with maximum action value Q, plus an upper confidence bound U that depends on a stored prior probability P and visit count N for that edge (which is incremented once traversed).
b (Expand and evaluate): The leaf node is expanded and the associated position s is evaluated by the neural network (P(s, ·),V(s))=fθ(s); the vector of P values are stored in the outgoing edges from s."
The circled node is the "leaf node" described in step b. But notice that it doesn't yet exist during the tree-walking phase (a). We created it during step b, and so "expand" s here must mean to create s (not its child nodes).
So it seems the paper describes the algorithm in two slightly different ways, using the word "expand" differently:
- Walk nodes until a leaf node is encountered. Evaluate it and expand it (create its children).
- Walk edges until the edge has no node attached to it (does this even meet the definition of a graph?). Create ("expand") that node and then evaluate it.
A single simulation proceeds as follows. We compute the action that maximises the upper confidence bound . If the next state (obtained by playing action on state ) exists in our tree, we recursively call the search on . If it does not exist, we add the new state to our tree and initialise and the value from the neural network, and initialise and to 0 for all . Instead of performing a rollout, we then propagate up along the path seen in the current simulation and update all values.
Maybe it's obvious while reading the paper that they're the same thing, but it sure confused me.
No comments:
Post a Comment