I plan to begin my presentation by suggesting that the adaptive
heuristic critic (AHC) idea implemented using neural network value and
action function approximators seems a plausible explanation of animal
learned skillful coping behavior. I will note that the output of the
function approximators should depend, not only on current stimulae
(state), but on some internal state of the approximator that in turn
depends on the prior trajectory of the process. (An Elman net is a
particular realization of this requirement.) This automatically
introduces sensitivity of value and action to whatever in the past is
relevant to the dynamics and reinforcement. Having touched on this
intriguing conceptual matter, I will quickly turn to an unrelated,
mundane but practical, concern.
Suppose that, in an AHC algorithm, the value and action mappings are
approximated via a parametrized functional form (e.g., a neural
network). Parameters are adjusted to fit learned values and actions
using TD(0) for the value adjustment. Let squared error=(value of
current step+value of finishing given the next state-value of
finishing given current state)**2. Changing the mapping parameters
affects the second and third term of the squared error. On a small
toy problem which is a special case of a markovian decision problem,
if the dynamics are deterministic and parameters are adjusted in the
negative gradient direction based on the above observation about
parameter effects and the step-size slowly goes to zero, the right
answer appears to result. Interestingly, if one neglects the effect
of parameters on the second term in the squared error above when
computing the "gradient", all still goes well, perhaps even more
rapidly. Now, and this is the point of my presentation, if the
dynamics are stochastic, using the wrong gradient (where the effect of
parameter changes on the second term in the squared error is
neglected) seems to yield almost, but not quite, correct expected
values, while using the right gradient including the effect on the
second term yields very wrong results. While I have some intuition
about why the right gradient gives the wrong result for a stochastic
problem, I have none concerning why the wrong gradient appears to give
the right result for the deterministic case and almost, but not quite,
correct results for the stochastic case. The listener is encouraged
to use the wrong gradient in his/her AHC computations for stochastic
problems, and hopefully to figure out why it almost works. More
crucial, one wonders, since both gradients are probably provably
wrong, if any procedure using a parametrized function approximator is
provably correct or if "approximately optimal" is the best one can
hope for concerning markovian decision models.
Online Fitted Reinforcement Learning
Geoff Gordon (CMU)
My paper in the main portion of the conference deals with fitted value
iteration or Q-learning for offline problems, ie, those where we have
a model of the environment so that we can examine arbitrary
transitions in arbitrary order. The same techniques also allow us to
do Q-learning for an online problem, ie, one where we have no model
but must instead perform experiments inside the MDP to gather data. I
will describe how.
Solving Partially Observable Markov Decision Processes Via VFA
Michael L. Littman (Brown)
Partially observable Markov decision processes (POMDPs) generalize
MDPs to the case where the decision maker does not necessarily have
perfect information as to its current state. It is well-known that,
given a complete description of a POMDP environment, such problems can
be formulated as continuous-space MDPs where the states are
probability distributions over the discrete states of the environment.
POMDPs can be viewed as one of the early, tentative successful
applications of VFA. Using some knowledge of the special structure of
POMDP value functions, we have used a simple VFA scheme to generate
high quality solutions to larger problems than are possible using
traditional analytic techniques. In my talk, I will summarize the
work in this area by Parr and Russell at Berkeley, and that of
Littman, Cassandra, and Kaelbling at Brown.
[Special note: I am very interested in seeing how well various
approaches to VFA handle POMDP domains. If you have a general
approach to finding good value functions for continuous space MDPs and
would be willing to let us try your algorithm on our problem, please
contact me: mlittman@cs.brown.edu.]
Learning Smooth Control by Fitting Value Function Derivatives
Satinder Singh (MIT)
The Hamilton-Jacobi-Bellman (HJB) equation for continuous optimal
control problems is expressed solely in terms of derivatives of the
value function*. Discretizing such problems leads to the familiar
Bellman equation that forms the basis for the "backups" in both
dynamic programming and reinforcement learning algorithms. We propose
a new method for solving continuous control problems that attempts to
train a function approximator to directly represent the derivatives of
the value function, without discretizing the state space. The goal of
the training is to satisfy as best as possible the constraints imposed
by the HJB equation (and the curl condition). We also use the crucial
fact that it is possible to do policy iteration based on these
derivatives alone. Note that there is nothing analogous to the usual
backup in this method. We tested our method of approximating the
derivatives of the value function as a linearly weighted combination
of local conservative vector fields derived as gradients of standard
radial basis functions. The method has as yet been applied to simple
problems only, and may have several limitations (including difficulty
in dealing with discontinuities in the value function space).
* (Note that derivatives of the value function are closely related
to the difference between Q (or state-action) values and V (or
state) values, called "advantages" by Baird. However, there does not
seem to be any way of learning the relative quantities
(advantages) without also learning absolute quantities in the
general discrete MDP. We show that at least in certain continuous
problems, it is possible to learn the relative quantities
(derivatives of value functions) alone.)
On The Virtues of Linear Learning and Online Training
Rich Sutton (U. Mass.)
In contrast to recent work by Boyan and Moore, I have obtained
excellent results applying conventional VFA methods to control tasks
such as ``Mountain Car" and ``Acrobot". The difference in our results
is due to the form of function approximator and the nature of the
training. I used a local function approximator known as a
CMAC---essentially a linear learner using a sparse, coarse-coded
representation of the state. Although not a complete solution, I
argue that this approach solves the ``curse of dimensionality" as well
as one can expect, and enables VFA systems to generalize as well as
other artificial learning systems. Also, I trained online, using
actual, experienced state trajectories and a form of the TD(lambda)
algorithm. A theorem by Dayan assures stability of TD(lambda) when
trained online. Gordon and Tsitsiklis and Van Roy, however, have
recently shown that TD(lambda) can be unstable when trained offline as
by Boyan and Moore. I conclude by discussing possibilities for
extending Dayan's theorem to prove tight bounds on the accuracy of
approximation under conditions of online training.
Counterintuitive aspects of TD-Gammon
Gerry Tesauro (IBM)
As most attendees at this workshop probably know, there are a number
of surprising and counterintuitive results obtained in applying TD
learning and a neural network function approximator to the domain of
backgammon, that by and large have not yet been obtained in other
applications. This talk will focus on some of these counterintuitive
findings, with particular emphasis on new data that has been obtained
in the domain of "Hypergammon," i.e., 3-checker backgammon.
Hypergammon is a much simpler yet still non-trivial variant of
backgammon, and it has the advantage that TD-trained neural nets can
be compared to the exact solution, which is commercially available as
a database on CD-ROM. Specific topics to be covered include: (a) why
relative error (i.e. the quality of the policy) is substantially
better than absolute error (i.e. the quality of the value function);
(b) scaling and parameter tuning, and the apparent insensitivity of
training to the particular value of lambda; (c) why quality of play is
surprisingly strong even with a "raw" input description; (d) why
early-game play is significantly better than end-game play.
[Effects of Overestimation in Q-learning with Function Aproximation]
Sebastian Thrun (Bonn)
[abstract]
Input Representation Engineering
Paul Utgoff (U. Mass.)
The effectiveness of value-function approximation depends
in large part on the ease with which a good approximation
can be represented over the input variables. I will illustrate
some of the input representation engineering that was needed
to enable a TD-lambda learner to find a high-quality evaluation
function for Othello.
PANEL DISCUSSION
Baird-Boyan-Moore-Singh-Sutton and audience
Some questions for discussion include:
-
Are there function approximators especially suited to VFA?
-
What does online learning (backups on real trajectories) buy you?
-
Are there alternatives to TD/Backup-based methods?
-
Do we need value functions at all (what about partigame, valueless
approaches, or optimization approaches)?
-
What are the roles of models? Note that the success stories described
by Tesauro, Crites and Dietterich have all applied to problems in
which the model is known a priori.
-
Have these problems already been attacked in other fields? Games? Control? OR?
-
How important are the consequences if VFA-based RL becomes a well-used
reliable method?