Game playing has dominated the Artificial Intelligence world as a problem domain ever since the field was born. Two-player games do not fit into the established reinforcement-learning framework since the optimality criterion for games is not one of maximizing reward in the face of a fixed environment, but one of maximizing reward against an optimal adversary (minimax). Nonetheless, reinforcement-learning algorithms can be adapted to work for a very general class of games [63] and many researchers have used reinforcement learning in these environments. One application, spectacularly far ahead of its time, was Samuel's checkers playing system [99]. This learned a value function represented by a linear function approximator, and employed a training scheme similar to the updates used in value iteration, temporal differences and Q-learning.
More recently, Tesauro [118, 119, 120] applied the temporal difference algorithm to backgammon. Backgammon has approximately states, making table-based reinforcement learning impossible. Instead, Tesauro used a backpropagation-based three-layer neural network as a function approximator for the value function
Two versions of the learning algorithm were used. The first, which we will call Basic TD-Gammon, used very little predefined knowledge of the game, and the representation of a board position was virtually a raw encoding, sufficiently powerful only to permit the neural network to distinguish between conceptually different positions. The second, TD-Gammon, was provided with the same raw state information supplemented by a number of hand-crafted features of backgammon board positions. Providing hand-crafted features in this manner is a good example of how inductive biases from human knowledge of the task can be supplied to a learning algorithm.
The training of both learning algorithms required several months of computer time, and was achieved by constant self-play. No exploration strategy was used--the system always greedily chose the move with the largest expected probability of victory. This naive exploration strategy proved entirely adequate for this environment, which is perhaps surprising given the considerable work in the reinforcement-learning literature which has produced numerous counter-examples to show that greedy exploration can lead to poor learning performance. Backgammon, however, has two important properties. Firstly, whatever policy is followed, every game is guaranteed to end in finite time, meaning that useful reward information is obtained fairly frequently. Secondly, the state transitions are sufficiently stochastic that independent of the policy, all states will occasionally be visited--a wrong initial value function has little danger of starving us from visiting a critical part of state space from which important information could be obtained.
The results (Table 2) of TD-Gammon are impressive. It has competed at the very top level of international human play. Basic TD-Gammon played respectably, but not at a professional standard.
Training Games | Hidden Units | Results | |
Basic | Poor | ||
TD 1.0 | 300,000 | 80 | Lost by 13 points in 51 games |
TD 2.0 | 800,000 | 40 | Lost by 7 points in 38 games |
TD 2.1 | 1,500,000 | 80 | Lost by 1 point in 40 games |
Although experiments with other games have in some cases produced interesting learning behavior, no success close to that of TD-Gammon has been repeated. Other games that have been studied include Go [104] and Chess [122]. It is still an open question as to if and how the success of TD-Gammon can be repeated in other domains.