Convergence of Reinforcement Learning

Not Residual Residual
Fixed
distribution
(on-policy)
Fixed
distribution
Usually-
greedy
distribution
Fixed
distribution
(on-policy)
Fixed
distribution
Usually-
greedy
distribution
Markov Chain Lookup table Y Y Y Y
Averager Y Y Y Y
Linear Y N Y Y
Nonlinear N N Y Y
MDP Lookup table Y Y Y Y Y Y
Averager Y Y N Y Y Y
Linear N N N Y Y Y
Nonlinear N N N Y Y Y
POMDP Lookup table N N N Y Y Y
Averager N N N Y Y Y
Linear N N N Y Y Y
Nonlinear N N N Y Y Y

This table gives convergence results for incremental RL algorithms such as TD(lambda), Q-learning, Advantage Learning, incremental value iteration, and SARSA. A green "Y" means the algorithm is guaranteed to converge in the same sense as Backprop. A red "N" means that counterexamples are known where it will fail to converge and will learn nothing useful, even when given infinite training time and slowly-decreasing learning rates. Blank entries are for nonsensical combinations, where the policy changes even though there is no policy.

This table suggests that residual algorithms might be useful in cases where more general function approximators are needed. It is also interesting that for most cases in the table, linear and nonlinear function approximators have the same properties, and on-policy and fixed distributions have the same properties.

The categories have the following definitions:

More Information

Back to Glossary Index