Convergence of Reinforcement Learning
|
|
Not Residual |
Residual |
|
|
Fixed distribution (on-policy) |
Fixed distribution |
Usually- greedy distribution |
Fixed distribution (on-policy) |
Fixed distribution |
Usually- greedy distribution |
Markov Chain |
Lookup table |
Y |
Y |
|
Y |
Y |
|
Averager |
Y |
Y |
|
Y |
Y |
|
Linear |
Y |
N |
|
Y |
Y |
|
Nonlinear |
N |
N |
|
Y |
Y |
|
MDP |
Lookup table |
Y |
Y |
Y |
Y |
Y |
Y |
Averager |
Y |
Y |
N |
Y |
Y |
Y |
Linear |
N |
N |
N |
Y |
Y |
Y |
Nonlinear |
N |
N |
N |
Y |
Y |
Y |
POMDP |
Lookup table |
N |
N |
N |
Y |
Y |
Y |
Averager |
N |
N |
N |
Y |
Y |
Y |
Linear |
N |
N |
N |
Y |
Y |
Y |
Nonlinear |
N |
N |
N |
Y |
Y |
Y |
This table gives convergence results for incremental RL algorithms such as
TD(lambda), Q-learning, Advantage Learning, incremental value iteration,
and SARSA. A green "Y" means the algorithm is guaranteed to converge in the
same sense as Backprop. A red "N" means that counterexamples are known
where it will fail to converge and will learn nothing useful, even when
given infinite training time and slowly-decreasing learning rates. Blank
entries are for nonsensical combinations, where the policy changes even
though there is no policy.
This table suggests that residual algorithms might be useful in cases
where more general function approximators are needed. It is also
interesting that for most cases in the table, linear and nonlinear
function approximators have the same properties, and on-policy and
fixed distributions have the same properties.
The categories have the following definitions:
- Residual -
using either residual gradient algorithms or residual algorithms to
guarantee convergence.
- Fixed distribution - Each state-action pair is updated with a fixed
frequency. This frequency may be arbitrary, not corresponding to any
particular exploration policy.
- Fixed distribution (on-policy) -
Each state action pair is updated with a frequency proportional
to the frequency it is seen while following a fixed exploration policy.
- Usually-greedy distribution -
Each state action pair is updated with a frequency proportional
to the frequency it is seen while following the current exploration policy.
The exploration policy changes over time, so that it performs the actions
with the highest learned value most often. For most algorithms it is defined
as epsilon-greedy or Boltzman. Since a greedy policy doesn't make
sense for Markov chains, there are no table entries for that combination.
- Lookup table -
A separate parameter is stored for each state-action pair.
- Averager -
A class of function approximator defined by Geoff Gordon including
KNN and memory-based systems.
- Linear -
A function approximator which is linear in the weights.
- Nonlinear -
An arbitrary, smooth, parameterized function approximator.
- Markov chain -
This is the problem of just learning a value function for
a given, fixed, stochastic policy.
- MDP -
Markov Decision Process. The problem of learning a policy
(mapping from state to action), assuming the state has the
Markov property (memory doesn't improve the policy).
- POMDP -
Partially Observable Markov Decision Process. This is the
general reinforcement learning problem where the state is
not observable. The optimal policy (mapping from observations
and memory to action) sometimes must be stochastic. This column
refers to applying Q-learning etc. directly to a POMDP, rather
than reducing it to an MDP first through the use of belief states
or tapped delay lines.
More Information
Back to Glossary Index