Formally, a policy is a mapping from a state space S to an action space A such that the agent using that policy executes action a whenever in state s. At the coarsest level, when in state s, an agent compares the expected, long-term rewards for taking each action , choosing an action based on these expected rewards. These expected rewards are learned through experience.
Designed to work in real-world domains with far too many states to handle individually, TPOT-RL constructs a smaller feature space V using action-dependent feature functions. The expected reward is then computed based on the state's corresponding entry in feature space.
In short, the policy's mapping from S to A in TPOT-RL can be thought of as a 3-step process:
While these steps are common in other RL paradigms, each step has unique characteristics in TPOT-RL.