Partially Observable Markov Decision Processes

A partially observable Markov decision process (POMDP) is a model for deciding how to act in ``an accessible, stochastic environment with a known transition model'' (Russell & Norvig [RN95], pg. 500). A POMDP is described by the following:

The transition probabilities describe how the state evolves with actions, and also represent the Markov assumption: the next state depends only on the current (unobservable) state and action and is independent of the preceding (unobserved) states and actions. The reward function describes the objective of the control, and the discount factor is used to ensure reasonable behaviour in the face of unlimited time. An optimal policy is known to always exist in the discounted ( $\gamma < 1$ ) case with bounded immediate reward [How60].

POMDP policies are often computed using a value function over the belief space. The value function $V_\pi(b)$ for a given policy $\pi$ is defined as the long-term expected reward the controller will receive starting at belief

and executing the policy $\pi$ up to some horizon time, which may be infinite. The optimal POMDP policy maximizes this value function. The value function for a POMDP policy under a finite horizon can be described using a piece-wise linear function over the space of beliefs. Many algorithms compute the value function iteratively, evaluating and refining the current value function estimate until no further refinements can improve the expected reward of the policy from any belief. Figure 3(a) shows the belief space for a three-state problem. The belief space is the two-dimensional, shaded simplex. Each point on the simplex corresponds to a particular belief (a three-dimensional vector), and the corners of the simplex represent beliefs where the state is known with 100% certainty. The value function shown in Figure 3(b) gives the long-term expected reward of a policy, starting at any belief in the simplex.

**Figure 3:** (a) The belief space for a three-state problem is the two-dimensional, shaded simplex. (b) A value function defined over the belief space. For the purposes of visualization, the set of beliefs that constitutes the belief space shown in (a) has been projected onto the plane in (b); the value function then rises along the positive axis. Each point in the belief space corresponds to a specific distribution, and the value function at that point gives the expected reward of the policy starting from this belief. The belief space (and therefore the value function) will have one fewer dimension than the total number of states in the problem.
$\includegraphics[height=1.4in]{figs/belief_space.eps}$ [The belief space]	$\includegraphics[height=1.4in]{figs/value_function.eps}$ [The value function]

The process of evaluating and refining the value function is at the core of why solving POMDPs is considered to be intractable. The value function is defined over the space of beliefs, which is continuous and high-dimensional; the belief space will have one fewer dimension than the number of states in the model. For a navigation problem in a map of thousands of possible states, computing the value function is an optimization problem over a continuous space with many thousands of dimensions, which is not feasible with existing algorithms.

However, careful consideration of some real-world problems suggests a possible approach for finding approximate value functions. If we examine the beliefs that a navigating mobile robot encounters, these beliefs share common attributes. The beliefs typically have a very small number of modes, and the particular shape of the modes is fairly generic. The modes move about and change in variance, but the ways in which the modes change is relatively constrained. In fact, even for real world navigation problems with very large belief spaces, the beliefs have very few degrees of freedom.

Figure 4(a) illustrates this idea: it shows a typical belief that a mobile robot might experience while navigating in the nursing home environment of Figure 1(b). To visualize the distribution we sample a set of poses (also called particles) according to the distribution and plot the particles on the map. The distribution is unimodal and the probability mass is mostly concentrated in a small area. Figure 4(b) shows a very different kind of belief: probability mass is spread over a wide area, there are multiple modes, and the locations of particles bear little relationship to the map. It would be difficult to find a sequence of actions and observations that would result in such a belief.

**Figure 4:** Two example probability distributions over robot pose. The small black dots are particles drawn from the distribution over discrete grid positions. On the left is a distribution where the robot's location is relatively certain; this kind of compact, unimodal distribution is very common in robot navigation. On the right is a very different, implausible distribution. The right hand distribution is sufficiently unlikely that we can afford to ignore it; even if we are unable to distinguish this belief from some other belief and as a result fail to identify its optimal action, the quality of our controller will be unaffected.
$\fbox{\includegraphics[width=2.5in]{figs/plausible.eps}}$ [A common belief]	$\fbox{\includegraphics[width=2.5in]{figs/implausible.homer.ps}}$ [An unlikely belief]

If real-world beliefs have few degrees of freedom, then they should be concentrated near a low-dimensional subset of the high-dimensional belief space--that is, the beliefs experienced by the controller should lie near a structured, low-dimensional surface embedded in the belief space. If we can find this surface, we will have a representation of the belief state in terms of a small set of bases or features. One benefit of such a representation is that we will need to plan only in terms of the small set of features: finding value functions in low-dimensional spaces is typically easier than finding value functions in high-dimensional spaces.

There are two potential disadvantages to this sort of representation. The first is that it contains an approximation: we are no longer finding the complete, optimal POMDP policy. Instead (as suggested in Figure 5) we are trying to find representations of the belief which are rich enough to allow good control but which are also sufficiently parsimonious to make the planning problem tractable. The second disadvantage is a technical one: because we are making a nonlinear transformation of the belief space, POMDP planning algorithms which assume a convex value function will no longer work. We discuss this problem in more detail in Section 6.

**Figure 5:** The most useful planner lies somewhere on a continuum between the MDP-style approximations and the full POMDP solution.
$\includegraphics[height=.7in]{figs/continuum2.eps}$