Markov Decision Problems

Next: Literature Review Up: Value Function Approximation for Previous: Value Function Approximation for

Markov Decision Problems

Formally, let our search space be represented as a Markov decision problem (MDP), defined by

a finite set of states X, including a set of start states ;
a finite set of actions A;
a reward function , where R(x,a) is the expected immediate reward (or negative cost) for taking action a in state x; and
an action model , where P(x'|x,a) gives the probability that executing action a in state x will lead to state x'.

An agent in an MDP environment observes its current state

, selects an action

, and as a result receives a reward and moves probabilistically to another state

. The model is flexible enough to represent AI planning problems, stochastic games (e.g. backgammon) against a fixed opponent, and combinatorial optimization search spaces. With natural extensions, it can also represent continuous stochastic control domains, two-player games, and many other problem formulations [Littman and Szepesvári1996, Harmon et al. 1995].

Specifying a fixed deterministic policy reduces an MDP to a Markov chain. The policy value function is the expected long-term reward accumulated over trajectories through the chain:

displaymath1455

Here, is a discount factor which determines how important long-term rewards are; my work in this thesis predominantly uses the undiscounted case of . The form of above is convenient because it satisfies this linear system of Bellman equations for prediction:

The optimal solution to an MDP is a policy which maximizes . The policy value function of is the optimal value function . It satisfies the Bellman equations for control:

equation94

From the value function , it is easy to infer the optimal policy: at any state x, any action which instantiates the max in Equation 4 is an optimal choice [Bellman1957]. This formalizes the notion that is an ideal evaluation function.

For small problems of up to or so discrete states, the value function can be stored in a lookup table and computed by any of a variety of algorithms, including linear programming [Denardo1982], policy iteration [Howard1960], and value iteration [Bellman1957]. Value iteration (VI) has been the basis for the most important simulation-based algorithms for learning , so I focus on it here.

VI computes by repeatedly sweeping over the state space, applying Equation 4 as an assignment statement (this is called a ``one-step backup'') at each state in parallel. If the lookup table is initialized with all 0's, then after the sweep of VI, the table will represent the maximum expected return of a path of length i from each state. For certain goal-oriented domains, this corresponds to the intuition that VI works by propagating correct values backwards, by one step per iteration, from the terminal states.

More precisely, there are two main classes of MDPs for which correct values can be assigned by working strictly backwards from terminal states:

deterministic domains with no positive-reward cycles and with every state able to reach at least one terminal state. This class includes shortest-path and minimum cost-to-go problems.
(possibly stochastic) acyclic domains: domains where no legal trajectory can pass through the same state twice. Many problems naturally have this property (e.g. games like tic-tac-toe and Connect-Four; combinatorial optimization domains as formulated in Section 3 below; any finite-horizon problem for which time is a component of the state).

Using VI to solve MDPs belonging to either of these special classes can be quite inefficient, since VI performs backups over the entire space, whereas the only backups useful for improving

are those on the ``frontier'' between already-correct and not-yet-correct

values. In fact, for small discrete problems there are classical algorithms for both problem classes which compute

more efficiently by explicitly working backwards: for the deterministic class, Dijkstra's shortest-path algorithm; and for the acyclic class, DIRECTED-ACYCLIC-GRAPH-SHORTEST-PATHS (DAG-SP) [Cormen et al. 1990].

DAG-SP first topologically sorts the MDP, producing a linear ordering of the states in which every state x precedes all states reachable from x. Then, it runs through that list in reverse, performing one backup per state. Worst-case bounds for VI, Dijkstra, and DAG-SP in deterministic domains with X states and A actions per state are

, and O(AX), respectively.

Another difference between VI and working backwards is that VI repeatedly re-estimates the values at every state, using old predictions to generate new training values. By contrast, Dijkstra and DAG-SP are always explicitly aware of which states have their values already known, and can hold those values fixed. This will be important when we introduce generalization and the possibility of approximation error.

The VI, Dijkstra and DAG-SP algorithms all apply exclusively to MDPs for which the state space can be exhaustively enumerated and the value function represented as a lookup table. For the quantized high-dimensional state spaces characteristic of real-world control tasks, the curse of dimensionality makes such enumeration intractable. Computing requires generalization: one natural technique is to encode the states as real-valued feature vectors and to use a function approximator to fit over this feature space. This technique goes by the names Neuro-Dynamic Programming [Bertsekas and Tsitsiklis1996] and Value Function Approximation (VFA) [Boyan et al. 1995].

Next: Literature Review Up: Value Function Approximation for Previous: Value Function Approximation for

Justin A. Boyan
Sat Jun 22 20:49:48 EDT 1996