As we have seen, TPOT-RL uses action-dependent features. Therefore,
we can assume that the expected long-term reward for taking action
depends only on the feature value related to action
. That
is,
whenever and
. In
other words, if
,
depends entirely upon
and is independent of
for all
.
Without this assumption, since there are |A| actions possible for
each feature vector, the value function Q has independent values. Under this assumption, however, the
Q-table has at most
entries: for each action possible from
each position, there is only one relevant feature value. Therefore,
even with only a small number of training examples available, we can
treat the value function Q as a lookup-table without the need for
any complex function approximation. To be precise, Q stores one
value for every possible combination of action a,
, and
.
For example, Table 1 shows the entire feature space for
one agent's partition of the state space when |U| = 3 and |A| = 2.
There are different entries in feature space with 2
Q-values for each entry: one for each possible action.
is much smaller than the original state space for any realistic
problem, but it can grow large quickly, particularly as |A|
increases. However, notice in Table 1 that, under the
assumption described above, there are only
independent Q-values
to learn, reducing the number of free variables in the learning
problem by 67% in this case.
Table 1: A sample Q-table for a single agent
when |U| = 3 and |A| = 2: ,
.
is the estimated value of taking action
when
. Since this table is for a single agent,
remains constant.
The Q-values learned depend on the agent's past experiences in
the domain. In particular, after taking an action a while in state
s with , an agent receives reward r and uses it to
update
as follows:
Since the agent is not able to access its teammates' internal states,
future team transitions are completely opaque from the agent's
perspective. Thus it cannot use dynamic programming to update its
Q-table. Instead, the reward r comes directly from the observable
environmental characteristics--those that are captured in S--over
a maximum number of time steps after the action is taken.
The reward function
returns a value at some time no further than
in the future. During that time, other teammates or opponents can act
in the environment and affect the action's outcome, but the agent may
not be able to observe these actions. For practical purposes, it is
crucial that the reward function is only a function of the observable
world from the acting agent's perspective. In practice, the
range of R is
where
is the reward for
immediate goal achievement .
The reward function, including and
, is
domain-dependent. One possible type of reward function is based
entirely upon reaching the ultimate goal. In this case, an agent
charts the actual (long-term) results of its policy in the
environment. However, it is often the case that goal achievement is
very infrequent. In order to increase the feedback from actions
taken, it is useful to use an internal reinforcement function, which
provides feedback based on intermediate states towards the goal. We
use this internal reinforcement approach in our work.