This implementation of TD( ) is trajectory-based. For a version of TD( ) that performs updates after each move, refer to [Sutton1987].
TD( , start states , fitter F): |
/* Assumes known world model MDP; F is parametrized by weight vector w. */ |
repeat steps 1 and 2 forever: |
Using the model and the current evaluation function F, generate a mostly-greedy |
trajectory from a start state to a terminal state: . |
Also record the rewards received at each step. |
Update the fitter from the trajectory as follows: |
for i := T downto 0, do: |
update F's weights by delta rule: := ; |
end |