Formally, a policy is a mapping from a state space S to an action
space A such that the agent using that policy executes action a
whenever in state s. At the coarsest level, when in state s, an
agent compares the expected, long-term rewards for taking each action
, choosing an action based on these expected rewards. These
expected rewards are learned through experience.
Designed to work in real-world domains with far too many states to
handle individually, TPOT-RL constructs a smaller feature space V
using action-dependent feature functions. The expected reward
is then computed based on the state's corresponding entry in
feature space.
In short, the policy's mapping from S to A in TPOT-RL can be thought of as a 3-step process:
While these steps are common in other RL paradigms, each step has unique characteristics in TPOT-RL.