In the soccer example, we applied TPOT-RL to enable each teammate to
simultaneously learn a high-level action policy. The policy is a
function that determines what an agent should do when it has
possession of the ball. The input of the policy is the agent's perception of the
current world state; the output is a target destination for the ball
in terms of a location on the field, e.g. the opponent's goal. In our
experiment, each agent has 8 possible actions as illustrated in
Figure 1(a). Since a player may not be able to tell the
results of other players' actions, or even when they can act, the
domain is opaque-transition.
A team formation is divided into 11 positions (m=11), as also shown
in Figure 1(a) [16]. Thus, the partition
function returns the player's position. Using our layered
learning approach, we use the previously trained DT as e. Each
possible pass is classified as either a likely success or a likely
failure with a confidence factor. Outputs of the DT could be
clustered based on the confidence factors. In our experiments, we
cluster into only two sets indicating success and failure. Therefore
|U| = 2 and
so
. Even though each agent only gets about 10 training examples per
10-minute game and the reward function shifts as teammate policies
improve, the learning task becomes feasible.
This feature space is immensely smaller than the original state space,
which has more than
states.
Since e indicates the likely success or failure of each possible
action, at action-selection time, we only consider the actions that
are likely to succeed (|W|=1). Therefore, each player learns 8
Q-values, with a total of 88 learned by the team as a whole. Even
with sparse training and shifting concepts, such a learning task is
tractable.