TD(#tex2html_wrap_inline1862#) for Control

This implementation of TD(

) is trajectory-based. For a version of TD(

) that performs updates after each move, refer to [Sutton1987].

TD( , start states , fitter F):

/* Assumes known world model MDP; F is parametrized by weight vector w. */

repeat steps 1 and 2 forever:

Using the model and the current evaluation function F, generate a mostly-greedy

trajectory from a start state to a terminal state: .

Also record the rewards received at each step.

Update the fitter from the trajectory as follows:

for i := T downto 0, do:

tex2html_wrap_inline1885

update F's weights by delta rule: := ;

end

TD( ) for Control