TPOT-RL's state generalization function relies on a unique approach to constructing V. Rather than discretizing the various dimensions of S, it uses action-dependent features. In particular, each possible action is evaluated locally based on the current state of the world using a fixed function . Unlike Q, e does not produce the expected long-term reward of taking an action; rather, it classifies the likely short-term effects of the action. For example, if actions sometimes succeed and sometimes fail to achieve their intended effects, e could indicate something of the following form: if selected, action is (or is not) likely to produce its intended effects.
In the multi-agent scenario, other than one output of e for each action, the feature space V also involves one coarse component that partitions the state space S among the agents. If the size of the team is m, then the partition function is with |M| = m. In particular, if the set of possible actions , then
Thus, . Since TPOT-RL has no over |A| or m, and since the goal of constructing V is to have a small feature space over which to learn, TPOT-RL will be more effective for small sets U.
This state generalization process reduces the complexity of the learning task by constructing a small feature space V which partitions S into m regions. Each agent need learn how to act only within its own partition. Nevertheless, for large sets A, the feature space can still be too large for learning, especially with limited training examples. Our particular action-dependent formulation allows us to reduce the effective size of the feature space in the value-function-learning step. Choosing features for state generalization is generally a hard problem. While TPOT-RL does not not specify the function e, our work uses a previously-learned dynamic feature function.