TPOT-RL's state generalization function relies on a
unique approach to constructing V. Rather than discretizing the
various dimensions of S, it uses action-dependent features.
In particular, each possible action
is evaluated locally based
on the current state of the world using a fixed function
. Unlike Q, e does not
produce the expected long-term reward of taking an action; rather, it
classifies the likely short-term effects of the action. For example,
if actions sometimes succeed and sometimes fail to achieve their
intended effects, e could indicate something of the following form:
if selected, action
is (or is not) likely to produce its
intended effects.
In the multi-agent scenario, other than one output of e for each
action, the feature space V also involves one coarse component that
partitions the state space S among the agents. If the size of the
team is m, then the partition function is with |M|
= m. In particular, if the set of possible actions
, then
Thus, . Since TPOT-RL has no over |A| or m,
and since the goal of constructing V is to have a small feature
space over which to learn, TPOT-RL will be more effective for small
sets U.
This state generalization process reduces the complexity of the learning task by constructing a small feature space V which partitions S into m regions. Each agent need learn how to act only within its own partition. Nevertheless, for large sets A, the feature space can still be too large for learning, especially with limited training examples. Our particular action-dependent formulation allows us to reduce the effective size of the feature space in the value-function-learning step. Choosing features for state generalization is generally a hard problem. While TPOT-RL does not not specify the function e, our work uses a previously-learned dynamic feature function.