This implementation of TD( ) is trajectory-based. For a version of
TD(
) that performs updates after each move, refer to [Sutton1987].
TD( ![]() ![]() |
/* Assumes known world model MDP; F is parametrized by weight vector w. */ |
repeat steps 1 and 2 forever: |
Using the model and the current evaluation function F, generate a mostly-greedy |
trajectory from a start state to a terminal state: ![]() |
Also record the rewards ![]() |
Update the fitter from the trajectory as follows: |
for i := T downto 0, do: |
![]() |
update F's weights by delta rule: ![]() ![]() |
end |