Monte-Carlo search methods have proven successful in automating high-performance play in many two-player games, including Poker, Scrabble, and the ancient oriental game of Go which we have used as a test platform for our learning algorithms. In Monte-Carlo search, many games of self-play are simulated, using a simulation policy to select moves for both players. A record of wins and loses in the simulated games is kept for each possible move, and move selections are gradually changed to favor the moves that lead to the most wins. As you might imagine, the quality of the moves eventually found in this way are highly dependent on the simulation policy. However, as we have shown in past work, the relationship is not a direct one: simulation policies that are stronger in terms of winning more games are not necessarily the best to use in self-play. More important than the absolute strength of the simulation policy is that outcomes with it be representative of the outcomes of a strong policy. A good simulation policy can make many errors as long as they are balanced, favoring neither side of most positions. Thus, in learning a good simulation policy, one could take as an objective not strong play, but balanced play in simulation, or simulation balancing.
This year we have come to better understand the phenomena of simulation balancing, and explored new algorithms for learning a balanced simulation policy. One algorithm uses policy-gradient reinforcement methods and a linear function approximator with weights for one hundred simple patterns. Such relatively simple approximators are preferred in the simulation policy so that many games can be simulated quickly in self-play, thus obtaining better Monte-Carlo estimates. Using a simulation policy learned with the objective of optimizing balance, and a simple Monte-Carlo search, we were able to obtain significantly improved simulation policies and similar overall performance to that of a sophisticated Go engine.