next up previous
Next: Value Function Learning Up: Team-PartitionedOpaque-Transition RL Previous: Team-PartitionedOpaque-Transition RL

State Generalization

TPOT-RL's state generalization function tex2html_wrap_inline1929 relies on a unique approach to constructing V. Rather than discretizing the various dimensions of S, it uses action-dependent features. In particular, each possible action tex2html_wrap_inline1935 is evaluated locally based on the current state of the world using a fixed function tex2html_wrap_inline1937 . Unlike Q, e does not produce the expected long-term reward of taking an action; rather, it classifies the likely short-term effects of the action. For example, if actions sometimes succeed and sometimes fail to achieve their intended effects, e could indicate something of the following form: if selected, action tex2html_wrap_inline1945 is (or is not) likely to produce its intended effects.

In the multi-agent scenario, other than one output of e for each action, the feature space V also involves one coarse component that partitions the state space S among the agents. If the size of the team is m, then the partition function is tex2html_wrap_inline1955 with |M| = m. In particular, if the set of possible actions tex2html_wrap_inline1959 , then

eqnarray1351

Thus, tex2html_wrap_inline1961 . Since TPOT-RL has no over |A| or m, and since the goal of constructing V is to have a small feature space over which to learn, TPOT-RL will be more effective for small sets U.

This state generalization process reduces the complexity of the learning task by constructing a small feature space V which partitions S into m regions. Each agent need learn how to act only within its own partition. Nevertheless, for large sets A, the feature space can still be too large for learning, especially with limited training examples. Our particular action-dependent formulation allows us to reduce the effective size of the feature space in the value-function-learning step. Choosing features for state generalization is generally a hard problem. While TPOT-RL does not not specify the function e, our work uses a previously-learned dynamic feature function.



next up previous
Next: Value Function Learning Up: Team-PartitionedOpaque-Transition RL Previous: Team-PartitionedOpaque-Transition RL



Peter Stone
Fri Feb 27 18:45:43 EST 1998