Any review of the literature on reinforcement learning and evaluation functions must begin with the pioneering work of Arthur Samuel on the game of checkers [Samuel1959, Samuel1967]. Samuel recognized the worth of the value function, saying that
we are attempting to make the score, calculated for the current board position, look like that calculated for the terminal board position of the chain of moves which most probably will occur during actual play. Of course, if one could develop a perfect system of this sort it would be the equivalent of always looking ahead to the end of the game. [Samuel1959, p. 219,]Samuel's program incrementally changed the coefficients of an evaluation polynomial so as to make each visited state's value closer to the value obtained from lookahead search.
After Samuel, work in evaluation function learning was sporadic until the 1980s. Christensen [Christensen1986] tried replacing Samuel's coefficient-tweaking procedure with least-squares regression, and was able to learn reasonable weights for the chess material-advantage function. Lee and Mahajan [Lee and Mahajan1988] trained a nonlinear evaluation function on expertly-played games, and it played high-quality Othello. But the biggest advance in the state of the art came more recently, when the reinforcement-learning community elaborated the connection between AI search and the Bellman equations [Barto et al. 1989, Watkins1989, Sutton1990]. This connection had been unexplored despite the publication of an AI textbook by Richard Bellman himself [Bellman1978]!
Reinforcement learning's most celebrated success has also been in a
game domain, the game of backgammon
[Tesauro1992, Tesauro1994, Boyan1992]. Tesauro modified Sutton's TD( )
algorithm [Sutton1988], which is normally thought of as a
model-free algorithm for learning to predict, into a model-based
algorithm for learning to control. An implementation is given in
Appendix A.1. When
, this algorithm reduces to
the RTDP algorithm [Barto et al.
1995], which is closely related to VI; the
key difference is that its backups are done along sample trajectories
through the MDP, rather than along sweeps of the entire state space.
Applying TD(
) with a standard multi-layer perceptron function
approximator, Tesauro's program learned an evaluation function which
produced expert-level backgammon play. Boyan [Boyan1992]
replicated Tesauro's results and demonstrated improvements using
modular neural networks.
Tesauro's combination of TD( ) and neural networks has been applied
to other domains, including elevator control
[Crites and Barto1996] and job-shop scheduling [Zhang and Dietterich1995]. (I will
discuss the scheduling application in detail in
Section 3.2.) Nevertheless, it is important to note that
when function approximators are used, TD(
) provides no guarantees of
optimality. In the case of uncontrolled Markov chains and linear
function approximators, online TD(
) does converge
[Sutton1988, Dayan1992, Tsitsiklis and Roy1996]--but even then, if
,
convergence is not necessarily to a good approximation of
[Bertsekas1995]. Moreover, in the general case of arbitrary
function approximators and Markov Decision Problems, repeatedly
applying one-step backups may propagate and enlarge approximation
errors, leading to instability [Thrun and Schwartz1993, Boyan and Moore1995, Gordon1995].