One way, then, to find an optimal policy is to find the optimal value
function. It can be determined by a simple iterative algorithm
called value iteration that can be shown to converge to the correct
values [10, 13].
initialize V(s) arbitrarilyloop until policy good enough
loop for
![]()
loop for
![]()
![]()
![]()
end loop
end loop
It is not obvious when to stop the value iteration algorithm. One
important result bounds the performance of the current greedy policy
as a function of the Bellman residual of the current value
function [134]. It says that if the maximum difference
between two successive value functions is less than , then
the value of the greedy policy, (the policy obtained by choosing, in
every state, the action that maximizes the estimated discounted
reward, using the current estimate of the value function) differs from
the value function of the optimal policy by no more than
at any state. This provides an effective stopping
criterion for the algorithm. Puterman [90] discusses
another stopping criterion, based on the span semi-norm, which
may result in earlier termination. Another important result is that
the greedy policy is guaranteed to be optimal in some finite number of
steps even though the value function may not have
converged [13]. And in practice, the greedy policy is
often optimal long before the value function has converged.
Value iteration is very flexible. The assignments to V need not be done in strict order as shown above, but instead can occur asynchronously in parallel provided that the value of every state gets updated infinitely often on an infinite run. These issues are treated extensively by Bertsekas [16], who also proves convergence results.
Updates based on Equation 1 are known as full backups since they make use of information from all possible successor states. It can be shown that updates of the form
can also be used as long as each pairing of a and s is updated
infinitely often, s' is sampled from the distribution T(s,a,s'),
r is sampled with mean R(s,a) and bounded variance, and the
learning rate is decreased slowly. This type of sample
backup [111] is critical to the operation of the model-free
methods discussed in the next section.
The computational complexity of the value-iteration algorithm with
full backups, per iteration, is quadratic in the number of states and
linear in the number of actions. Commonly, the transition
probabilities T(s,a,s') are sparse. If there are on average a
constant number of next states with non-zero probability then the cost
per iteration is linear in the number of states and linear in the
number of actions. The number of iterations required to reach the
optimal value function is polynomial in the number of states and the
magnitude of the largest reward if the discount factor is held
constant. However, in the worst case the number of iterations grows
polynomially in , so the convergence rate slows
considerably as the discount factor approaches 1 [66].