Informative action-dependent features can be used to reduce the free
variables in the learning task still further at the action-selection
stage if the features themselves discriminate situations in which
actions should not be used. For example, if whenever ,
is not likely to achieve its expected reward, then
the agent can decide to ignore actions with
.
Formally, consider and
with
. When in state s, the agent then
chooses an action from
, either randomly when exploring or
according to maximum Q-value when exploiting. Any exploration
strategy, such as Boltzman exploration, can be used over the possible
actions in
. In effect, W acts in TPOT-RL as an action filter
which reduces the number of options under consideration at any given
time. Of course, exploration at the filter level can be achieved by
dynamically adjusting W.
Table 2: The resulting Q-tables when (a)
, and (b)
.
For example, Table 2, illustrates the effect of
varying |W|. In the rare event that , i.e.
, either a
random action can be chosen, or rough Q-value estimates can be stored
using sparse training data. This condition becomes rarer as |A|
increases. For example, with |U| = 3, |W| = 1, |A| = 2 as in
Table 2(b), 4/9 = 44.4% of feature vectors
have no action that passes the W filter. However, with |A| = 8
only 256/6561 = 3.9% of feature vectors have no action that passes
the W filter. If |W| = 2 and |A| = 8, only 1 of 6561 feature
vectors fails to pass the filter. Thus using W to filter action
selection can reduce the number of free variables in the learning
problem without significantly reducing the coverage of the learned
Q-table.
By using action-dependent features to create a coarse feature space, and with the help of a reward function based entirely on individual observation of the environment, TPOT-RL enables team learning in a multi-agent, adversarial environment even when agents cannot track state transitions.