next up previous
Next: Policies and Value Functions Up: POMDPs and Value Iteration Previous: POMDPs and Value Iteration

POMDPs

A partially observable Markov decision process (POMDP) is a sequential decision model for an agent who acts in a stochastic environment with only partial knowledge about the state of its environment. The set of possible states of the environment is referred to as the state space and is denoted by tex2html_wrap_inline992 . At each point in time, the environment is in one of the possible states. The agent does not directly observe the state. Rather, it receives an observation about it. We denote the set of all possible observations by tex2html_wrap_inline994 . After receiving the observation, the agent chooses an action from a set tex2html_wrap_inline996 of possible actions and executes that action. Thereafter, the agent receives an immediate reward and the environment evolves stochastically into a next state.

Mathematically, a POMDP is specified by: the three sets tex2html_wrap_inline992 , tex2html_wrap_inline994 , and tex2html_wrap_inline996 ; a reward function r(s,a); a transition probability function P(s'|s, a); and an observation probability function P(z|s', a). The reward function characterizes the dependency of the immediate reward on the current state s and the current action a. The transition probability characterizes the dependency of the next state s' on the current state s and the current action a. The observation probability characterizes the dependency of the observation z at the next time point on the next state s' and the current action a.



Dr. Lian Wen Zhang
Thu Feb 15 14:47:09 HKT 2001