Formally, let our search space be represented as a Markov decision problem (MDP), defined by
Specifying a fixed deterministic policy reduces an MDP to a Markov chain. The policy value function is the expected long-term reward accumulated over trajectories through the chain:
Here, is a discount factor which determines how important long-term rewards are; my work in this thesis predominantly uses the undiscounted case of . The form of above is convenient because it satisfies this linear system of Bellman equations for prediction:
The optimal solution to an MDP is a policy which maximizes . The policy value function of is the optimal value function . It satisfies the Bellman equations for control:
From the value function , it is easy to infer the optimal policy: at any state x, any action which instantiates the max in Equation 4 is an optimal choice [Bellman1957]. This formalizes the notion that is an ideal evaluation function.
For small problems of up to or so discrete states, the value function can be stored in a lookup table and computed by any of a variety of algorithms, including linear programming [Denardo1982], policy iteration [Howard1960], and value iteration [Bellman1957]. Value iteration (VI) has been the basis for the most important simulation-based algorithms for learning , so I focus on it here.
VI computes by repeatedly sweeping over the state space, applying Equation 4 as an assignment statement (this is called a ``one-step backup'') at each state in parallel. If the lookup table is initialized with all 0's, then after the sweep of VI, the table will represent the maximum expected return of a path of length i from each state. For certain goal-oriented domains, this corresponds to the intuition that VI works by propagating correct values backwards, by one step per iteration, from the terminal states.
More precisely, there are two main classes of MDPs for which correct values can be assigned by working strictly backwards from terminal states:
Another difference between VI and working backwards is that VI repeatedly re-estimates the values at every state, using old predictions to generate new training values. By contrast, Dijkstra and DAG-SP are always explicitly aware of which states have their values already known, and can hold those values fixed. This will be important when we introduce generalization and the possibility of approximation error.
The VI, Dijkstra and DAG-SP algorithms all apply exclusively to MDPs for which the state space can be exhaustively enumerated and the value function represented as a lookup table. For the quantized high-dimensional state spaces characteristic of real-world control tasks, the curse of dimensionality makes such enumeration intractable. Computing requires generalization: one natural technique is to encode the states as real-valued feature vectors and to use a function approximator to fit over this feature space. This technique goes by the names Neuro-Dynamic Programming [Bertsekas and Tsitsiklis1996] and Value Function Approximation (VFA) [Boyan et al. 1995].