A partially observable Markov decision process (POMDP) is a sequential decision model for an agent who acts in a stochastic environment with only partial knowledge about the state of its environment. The set of possible states of the environment is referred to as the state space and is denoted by . At each point in time, the environment is in one of the possible states. The agent does not directly observe the state. Rather, it receives an observation about it. We denote the set of all possible observations by . After receiving the observation, the agent chooses an action from a set of possible actions and executes that action. Thereafter, the agent receives an immediate reward and the environment evolves stochastically into a next state.
Mathematically, a POMDP is specified by: the three sets , , and ; a reward function r(s,a); a transition probability function P(s'|s, a); and an observation probability function P(z|s', a). The reward function characterizes the dependency of the immediate reward on the current state s and the current action a. The transition probability characterizes the dependency of the next state s' on the current state s and the current action a. The observation probability characterizes the dependency of the observation z at the next time point on the next state s' and the current action a.