Value iteration works by producing successive approximations of the
optimal value function. Each iteration can be performed in
steps, or faster if there is sparsity in the transition
function. However, the number of iterations required can grow
exponentially in the discount factor [27]; as the discount
factor approaches 1, the decisions must be based on results that
happen farther and farther into the future. In practice, policy
iteration converges in fewer iterations than value iteration, although
the per-iteration costs of
can be prohibitive.
There is no known tight worst-case bound available for policy
iteration [66]. Modified policy
iteration [91] seeks a trade-off between cheap and
effective iterations and is preferred by some
practictioners [96].
Linear programming [105] is an extremely general problem, and MDPs can be solved by general-purpose linear-programming packages [35, 34, 46]. An advantage of this approach is that commercial-quality linear-programming packages are available, although the time and space requirements can still be quite high. From a theoretic perspective, linear programming is the only known algorithm that can solve MDPs in polynomial time, although the theoretically efficient algorithms have not been shown to be efficient in practice.