Formally, let our search space be represented as a Markov decision problem (MDP), defined by
Specifying a fixed deterministic policy reduces
an MDP to a Markov chain. The policy value function
is the expected long-term reward accumulated over trajectories through
the chain:
Here, is a discount factor which determines
how important long-term rewards are; my work in this thesis
predominantly uses the undiscounted case of
.
The form of
above is convenient because it satisfies this
linear system of Bellman equations for prediction:
The optimal solution to an MDP is a policy which maximizes
. The policy value function of
is the optimal value
function
. It satisfies the Bellman equations for
control:
From the value function , it is easy to infer the optimal policy:
at any state x, any action which instantiates the max in
Equation 4 is an optimal choice [Bellman1957]. This
formalizes the notion that
is an ideal evaluation function.
For small problems of up to or so discrete states, the value
function can be stored in a lookup table and computed by any of a
variety of algorithms, including linear programming
[Denardo1982], policy iteration [Howard1960], and
value iteration [Bellman1957]. Value iteration (VI) has been
the basis for the most important simulation-based algorithms for
learning
, so I focus on it here.
VI computes by repeatedly sweeping over the state space,
applying Equation 4 as an assignment statement (this is
called a ``one-step backup'') at each state in parallel. If the
lookup table is initialized with all 0's, then after the
sweep of VI, the table will represent the maximum
expected return of a path of length i from each state. For certain
goal-oriented domains, this corresponds to the intuition that VI works
by propagating correct
values backwards, by one step per
iteration, from the terminal states.
More precisely, there are two main classes of MDPs for which correct
values can be assigned by working strictly backwards from
terminal states:
Another difference between VI and working backwards is that VI
repeatedly re-estimates the values at every state, using old
predictions to generate new training values. By contrast, Dijkstra
and DAG-SP are always explicitly aware of which states have their
values already known, and can hold those values fixed. This
will be important when we introduce generalization and the possibility
of approximation error.
The VI, Dijkstra and DAG-SP algorithms all apply exclusively to MDPs
for which the state space can be exhaustively enumerated and the value
function represented as a lookup table. For the quantized
high-dimensional state spaces characteristic of real-world control
tasks, the curse of dimensionality makes such enumeration
intractable. Computing requires generalization: one natural
technique is to encode the states as real-valued feature vectors and
to use a function approximator to fit
over this feature space.
This technique goes by the names Neuro-Dynamic Programming
[Bertsekas and Tsitsiklis1996] and Value Function Approximation (VFA)
[Boyan et al.
1995].