A partially observable Markov decision process (POMDP) is a model for deciding how to act in ``an accessible, stochastic environment with a known transition model'' (Russell & Norvig [RN95], pg. 500). A POMDP is described by the following:
The transition probabilities describe how the state evolves with actions, and
also represent the Markov assumption: the next state depends only on the
current (unobservable) state and action and is independent of the preceding
(unobserved) states and actions. The reward function describes the objective
of the control, and the discount factor is used to ensure reasonable behaviour
in the face of unlimited time. An optimal policy is known to always exist in
the discounted (
) case with bounded immediate
reward [How60].
POMDP policies are often computed using a value function over the
belief space. The value function for a given policy
is defined as the long-term expected reward the controller will
receive starting at belief
and executing the policy
up to
some horizon time, which may be infinite. The optimal POMDP policy
maximizes this value function. The value function for a POMDP policy
under a finite horizon can be described using a piece-wise linear
function over the space of beliefs. Many algorithms compute the value
function iteratively, evaluating and refining the current value
function estimate until no further refinements can improve the
expected reward of the policy from any belief.
Figure 3(a) shows the belief space for a
three-state problem. The belief space is the two-dimensional, shaded
simplex. Each point on the simplex corresponds to a particular belief
(a three-dimensional vector), and the corners of the simplex represent
beliefs where the state is known with 100% certainty. The value
function shown in Figure 3(b) gives the long-term
expected reward of a policy, starting at any belief in the simplex.
![]() [The belief space] |
![]() [The value function] |
The process of evaluating and refining the value function is at the core of why solving POMDPs is considered to be intractable. The value function is defined over the space of beliefs, which is continuous and high-dimensional; the belief space will have one fewer dimension than the number of states in the model. For a navigation problem in a map of thousands of possible states, computing the value function is an optimization problem over a continuous space with many thousands of dimensions, which is not feasible with existing algorithms.
However, careful consideration of some real-world problems suggests a possible approach for finding approximate value functions. If we examine the beliefs that a navigating mobile robot encounters, these beliefs share common attributes. The beliefs typically have a very small number of modes, and the particular shape of the modes is fairly generic. The modes move about and change in variance, but the ways in which the modes change is relatively constrained. In fact, even for real world navigation problems with very large belief spaces, the beliefs have very few degrees of freedom.
Figure 4(a) illustrates this idea: it shows a typical belief that a mobile robot might experience while navigating in the nursing home environment of Figure 1(b). To visualize the distribution we sample a set of poses (also called particles) according to the distribution and plot the particles on the map. The distribution is unimodal and the probability mass is mostly concentrated in a small area. Figure 4(b) shows a very different kind of belief: probability mass is spread over a wide area, there are multiple modes, and the locations of particles bear little relationship to the map. It would be difficult to find a sequence of actions and observations that would result in such a belief.
![]() [A common belief] |
![]() [An unlikely belief] |
If real-world beliefs have few degrees of freedom, then they should be concentrated near a low-dimensional subset of the high-dimensional belief space--that is, the beliefs experienced by the controller should lie near a structured, low-dimensional surface embedded in the belief space. If we can find this surface, we will have a representation of the belief state in terms of a small set of bases or features. One benefit of such a representation is that we will need to plan only in terms of the small set of features: finding value functions in low-dimensional spaces is typically easier than finding value functions in high-dimensional spaces.
There are two potential disadvantages to this sort of representation. The first is that it contains an approximation: we are no longer finding the complete, optimal POMDP policy. Instead (as suggested in Figure 5) we are trying to find representations of the belief which are rich enough to allow good control but which are also sufficiently parsimonious to make the planning problem tractable. The second disadvantage is a technical one: because we are making a nonlinear transformation of the belief space, POMDP planning algorithms which assume a convex value function will no longer work. We discuss this problem in more detail in Section 6.
![]() |