Yet, two key shortcomings limit the scalability of these BDI-based theories and implementations. First, there are no techniques for the quantitative evaluation of the degree of optimality of their coordination behavior. While optimal teamwork may be impractical in real-world domains, such analysis would aid us in comparison of different theories/models and in identifying feasible improvements. One key reason for the difficulty in quantitative evaluation of most existing teamwork theories is that they ignore the various uncertainties and costs in real-world environments. For instance, joint intentions theory [Cohen & Levesque, 1991b] prescribes that team members attain mutual beliefs in key circumstances, but it ignores the cost of attaining mutual belief (e.g., via communication). Implementations that blindly follow such prescriptions could engage in highly suboptimal coordination. On the other hand, practical systems have addressed costs and uncertainties of real-world environments. For instance, STEAM [Tambe, 1997,Tambe & Zhang, 1998] extends joint intentions with decision-theoretic communication selectivity. Unfortunately, the very pragmatism of such approaches often necessarily leads to a lack of theoretical rigor, so it remains unanswered whether STEAM's selectivity is the best an agent can do, or whether it is even necessary at all. The second key shortcoming of existing teamwork research is the lack of a characterization of the computational complexity of various aspects of teamwork decisions. Understanding the computational advantages of a practical coordination prescription could potentially justify the use of that prescription as an approximation to optimality in particular domains.
To address these shortcomings, we propose a new complementary framework, the COMmunicative Multiagent Team Decision Problem (COM-MTDP), inspired by work in economic team theory [Marschak & Radner, 1971,Yoshikawa, 1978,Ho, 1980]. While our COM-MTDP model borrows from a theory developed in another field, we make several contributions in applying and extending the original theory, most notably adding explicit models of communication and system dynamics. With these extensions, the COM-MTDP generalizes other recently developed multiagent decision frameworks, such as decentralized POMDPs [Bernstein, Zilberstein, & Immerman, 2000].
Our definition of a team (like that in economic team theory) assumes only that team members have a common goal and that they work selflessly towards that goal (i.e., they have no other private goals of their own). In terms of our decision-theoretic framework, we assume that all of the team members share the same joint utility function-that is, each team member's individual preferences are exactly the preferences of the other members and, thus, of the team as a whole. Our definition may appear to be a ``bare-bones'' definition of a team, since it does not include common concepts and assumptions from the literature on what constitutes a team (e.g., the teammates form a joint commitment [Cohen & Levesque, 1991b], attain mutual belief upon termination of a joint goal, intend that teammates succeed in their tasks [Grosz & Kraus, 1996], etc.). From our COM-MTDP perspective, we view these concepts as more intermediate concepts, as the means by which agents improve their team's overall performance, rather than ends in themselves. Our hypothesis in this investigation is that our COM-MTDP-based analysis can provide concrete justifications for these concepts. For example, while mutual belief has no inherent value, our COM-MTDP model can quantify the improved performance that we would expect from a team that attains mutual belief about important aspects of its execution.
More generally, this paper demonstrates three new types of teamwork analyses made possible by the COM-MTDP model. First, we analyze the computational complexity of teamwork within subclasses of problem domains. For instance, some researchers have advocated teamwork without communication [Goldberg & Mataric, 1997]. We use the COM-MTDP model to show that, in general, the problem of constructing optimal teams without communication is NEXP-complete, but allowing free communication reduces the problem to be PSPACE-complete. This paper presents a breakdown of the complexity of optimal teamwork over problem domains classified along the dimensions of observability and communication cost.
Second, the COM-MTDP model provides a powerful tool for comparing the optimality of different coordination prescriptions across classes of domains. Indeed, we illustrate that we can encode existing team coordination strategies within a COM-MTDP for evaluation. For our analysis, we selected two joint intentions-based approaches from the literature: one using the approach realized within GRATE* and the joint responsibility model [Jennings, 1995], and another based on STEAM [Tambe, 1997]. Through this encoding, we derive the conditions under which these team coordination strategies generate optimal team behavior, and the complexity of the decision problems addressed by them. Furthermore, we also derive a novel team coordination algorithm that outperforms these existing strategies in optimality, though not in efficiency. The end result is a well-grounded characterization of the complexity-optimality tradeoff among various means of team coordination.
Third, we can use the COM-MTDP model to empirically analyze a specific domain of interest. We have implemented reusable, domain-independent algorithms that allow one to evaluate the optimality of the behavior generated by different prescriptive policies within a problem domain represented as a COM-MTDP. We apply these algorithms in an example domain to empirically evaluate the aforementioned team coordination strategies, characterizing the optimality of each strategy as a function of the properties of the underlying domain. For instance, Jennings reports experimental results [Jennings, 1995] indicating that his joint responsibility teamwork model leads to lower waste of community effort than competing methods of accomplishing teamwork. With our COM-MTDP model, we were able to demonstrate the benefits of Jennings' approach under many configurations of our example domain. However, in precisely characterizing the types of domains that showed such benefits, we also identified domains where these competing methods may actually perform better. In addition, we can use our COM-MTDP model to re-create and explain previous work that noted an instance of suboptimality in a STEAM-based, real-world implementation [Tambe, 1997]. While this previous work treated that suboptimality as anomalous, our COM-MTDP re-evaluation of the domain demonstrated that the observed suboptimality was a symptom of STEAM's general propensity towards extraneous communication in a significant range of domain types. Both the algorithms and the example domain model are available for public use in an Online Appendix 1.
Section 2 presents the COM-MTDP model's representation and places it in the context of related multiagent models from the literature. Section 3 uses the COM-MTDP model to define and characterize the complexity of designing optimal agent teams. Section 4 analyzes the optimality of existing team coordination algorithms and derives a novel coordination algorithm. Section 5 presents empirical results from applying our COM-MTDP algorithms to an example domain. Section 6 summarizes our results, and Section 7 identifies some promising future directions.
* Extension to Dynamic Problem: P The original team decision problem focused on a one-shot, static problem. We extend the original concept so that each component is a time series of random variables. The effects of domain-level actions (e.g., a flying action changes a helicopter's position) obey a probabilistic distribution, given by a function P:S×Aa×S®[0,1]. In other words, for each initial state s at time t, combined action a taken at time t, and final state s¢ at time t+1, Pr(St+1=s¢|St=s,Aat=a)=P(s,a,s¢). The given definition of P assumes that the world dynamics obey the Markov assumption.
* Extension of Allowable Information Structures: Oa We extend the information structure representation to allow for uncertain observations. We use a general stochastic model, borrowed from the partially observable Markov decision process model [Smallwood & Sondik, 1973], with a joint observation function: Oa(s,a,w) = Pr(Wat=w|St=s,Aat-1=a). This function models the sensors, representing any errors, noise, etc. In some cases, we can separate this joint distribution into individual observation functions: Oa º Õi Î aOi, where Oi(s,a,w) = Pr(Wit=w|St=s,Aat-1=a). Thus, the probability distribution specified by Oa forms the richer information structure used in our model. We can make useful distinctions between different classes of information structures:
* Extension to Richer Belief State Space: Ba We generalize the set of possible strategies to capture the more complex mental states of the agents. Each agent, i Î a, forms a belief state, bit Î Bi, based on its observations seen through time t, where Bi circumscribes the set of possible belief states for the agent. Thus, we define the set of possible domain-level policies as mappings from belief states to actions, piA:Bi® Ai. We define the set of possible combined belief states over all agents to be Ba º Õi Î aBi. The corresponding random variable, bat, represents the agents' combined belief state at time t. We elaborate on different types of belief states and the mapping of observations to belief states (i.e., the state estimator function) in Section 2.2.1.
With communication, we divide each decision epoch into two phases: the
pre-communication and post-communication phases, denoted by
the subscripts ·S and S·, respectively. In
particular, the agents update their belief states at two distinct points
within each decision epoch: once upon receiving observation Wit
(producing the pre-communication belief state bi·St), and
again upon receiving the other agents' messages (producing the
post-communication belief state biS·t). The distinction
allows us to differentiate between the belief state used by the agents in
selecting their communication actions and the more ``up-to-date'' belief state
used in selecting their domain-level actions. We also distinguish between the
separate state-estimator functions used in each update phase:
|
In this paper, as a first step, we assume that the agents have
perfect recall. In other words, the agents recall all of their
observations, as well as all communication of the other agents. Thus,
their belief states can represent their entire histories as sequences of
observations and received messages:
Bi=Wi*×Sa*, where X* denotes the
set of all possible sequences (of any length) of elements of X. The
agents realize perfect recall through the following state estimator
functions:
|
We extend our definition of a policy of behavior to include a communication policy, piS:Bi®Si, analogous to Section 2.1.4's domain-level policy. We define the joint policies, paS and paA, as the combined policies across all agents in a.
As with the observability function, we parameterize the communication costs associated with message transmissions:
The free-communication case appears in the literature, when researchers wish to focus on issues other than communication cost. Although, real-world domains rarely exhibit such ideal conditions, we may be able to model some domains as having approximately free communication to a sufficient degree. In addition, analyzing this extreme case gives us some understanding of the benefit of communication, even if the results do not apply across all domains. We also identify the no-communication case because such decision problems have been of interest to researchers as well [Goldberg & Mataric, 1997]. Of course, even if Sa=Æ, it is possible that there are domain-level actions in Aa that have implicit communicative value by acting as signals that convey information to the other agents. However, we still label such agent teams as having no communication for the purposes of the work here, since many of our results exploit an explicit separation between domain- and communication-level actions.
Next, each agent i Î a selects a message according to its communication policy, Si0=piS(bi·S0), defining a combined communication, Sa0. Each agent interprets the communications of all of the others by updating its belief state, biS·0=SEiS·(bi·S0,Sa0). Each then selects an action according to its domain-level policy, Ai0=piA(biS·0), defining a combined action Aa0. By our central assumption of teamwork, each agent receives the same joint reward, R0=R(S0,Aa0,Sa0). The world then moves into a new state, S1, according to the distribution, P(S0,Aa0). Again, each agent i receives an observation Wi1 drawn from Wi according to the distribution Oa(S1,Aa0,Wa1), and it updates its belief state, bi·S1=SEi·S(biS·0,Wi1).
The process continues, with agents choosing communication- and domain-level actions, observing the effects, and updating their beliefs. Thus, in addition to the time series of world states, S0,S1,¼,St, the agents themselves determine a time series of communication-level and domain-level actions, Sa0,Sa1,¼,Sat and Aa1,Aa1,¼,Aat, respectively. We also have a time series of observations for each agent i, Wi0,Wi1,¼,Wit. Likewise, we can treat the combined observations, Wa0,Wa1,¼,Wat, as a similar time series of random variables.
Finally, the agents receive a series of rewards, R0,R1,¼,Rt. We
can define the value, V, of the policies,
paA and paS, as the
expected reward received when executing those policies. Over a finite
horizon, T, this value is equivalent to the following:
| (7) |
Model | Sa | Oa |
DEC-POMDP | no communication | collective partial observability |
POIPSG | no communication | collective partial observability |
MMDP | no communication | individual observability |
Xuan-Lesser | general communication | collective observability |
Theorem 1
The decision problem of whether there exist policies,
paS and paA, for a
given COM-MTDP, under general communication and collective
partial observability, that yield a total reward at least K over some
finite horizon T is NEXP-complete if |a| ³ 2 (i.e., more than
one agent).
| (8) |
To show that the COM-MTDP is in NEXP, our proof proceeds similarly to that of the DEC-POMDP. In other words, we guess the joint policy, pa, and write it down in exponential time (we assume that T £ |S|). We can take the COM-MTDP plus the policy and generate (in exponential time) a corresponding MDP where the state space is the space of all possible combined belief states of the agents. We can then use dynamic programming to determine (in exponential time) whether pa generates an expected reward of at least K. [¯]
In the remainder of this section, we examine the effect of communication on the complexity of constructing team policies that generate optimal behavior. We start by examining the case under the condition of free communication, where we would expect the benefit of communication to be the greatest. To begin with, suppose that each agent is capable of communicating its entire observation (i.e., Si Ê Wi). Before we analyze the complexity of the team decision problem, we first prove that the agents should exploit this capability and communicate their true observation, as long as they incur no cost in doing so:
Theorem 2
Under free communication, consider a team of agents using a
communication policy:
piS(bi·St) º Wit. If the domain-level
policy paA maximizes
VT(paA,paS), then
this combined policy is dominant over any other policies. In other words,
for all policies, paA¢ and
paS¢, VT(paA,
paS) ³ VT(paA¢,paS¢).
The communication policy, paS¢, produces a
different set of belief states (denoted b¢i·St and
b¢iS·t) than those for paS
(denoted bi·St and biS·t). In
particular, we use state estimator functions, SEi·S¢ and
SEiS·¢ as defined in Equations 5
and 6 to generate b¢i·St and
b¢iS·t. Each belief state is a complete history of
observation and communication pairs for each agent. On the other hand,
under the complete communication of paS, the
state estimator functions of Equations 5 and
6 reduce to:
|
|
Given this dominance of the complete-communication policy, we can prove that the problem of constructing teams that coordinate optimally is simpler when communication is free.
Theorem 3
The decision problem of determining whether there exist policies,
paS and paA, for a
given COM-MTDP with free communication under collective partial
observability, that yield a total reward at least K over some finite
horizon T is PSPACE-complete.
To show that the problem is in PSPACE, we take a COM-MTDP under free communication and reduce it to a single-agent POMDP. In particular, if we are given a COM-MTDP, < S,Aa,Sa,P, Wa,Oa,Ba,R > , we can construct a single-agent POMDP, < S¢,A¢,P¢,W¢,O¢, R¢ > , as follows:
| (12) |
Theorem 4
The decision problem of determining whether there exist policies,
paS and paA, for a
given COM-MTDP with free communication and collective
observability, that yield a total reward at least K over some finite
horizon T is P-complete.
Theorem 5
The decision problem of determining whether there exist policies,
paS and paA, for a
given COM-MTDP with individual observability, that yield a total
reward at least K over some finite horizon T (given integers K and
T) is P-complete.
Theorem 6
The decision problem of determining whether there exist policies,
paS and paA, for a
given COM-MTDP with non-observability, that yield a total
reward at least K over some finite horizon T (given integers K and
T) is NP-complete.
Thus, we have used the COM-MTDP framework to characterize the difficulty of problem domains in agent teamwork along the dimensions of communication cost and observability. Table 2 summarizes our results, which we can use in deciding where to concentrate our energies in attacking teamwork problems. We can use these results to draw some conclusions about the challenges to designers of multiagent teams:
Individually | Collectively | Collectively | Non- | |
Observable | Observable | Partially Observable | Observable | |
No Comm. | P-complete | NEXP-complete | NEXP-complete | NP-Complete |
General Comm. | P-complete | NEXP-complete | NEXP-complete | NP-Complete |
Free Comm. | P-complete | P-complete | PSPACE-complete | NP-Complete |
We demonstrate this methodology by using our COM-MTDP framework to analyze joint intentions theory [Cohen & Levesque, 1991b,Cohen & Levesque, 1991a,Levesque, Cohen, & Nunes, 1990], which provides a common basis for many existing approaches to team coordination. Section 4.1 models two key instantiations of joint intentions taken from the literature [Jennings, 1995,Tambe, 1997] as COM-MTDP communication policies. Section 4.2 analyzes the conditions under which these policies generate optimal behavior and provides a third candidate policy that makes communication decisions that are locally optimal within the context of joint intentions. In addition to providing the results for the particular team coordination strategies investigated, this section also illustrates a general methodology by which one can use our COM-MTDP framework to encode and evaluate coordination strategies proposed by existing multiagent research.
Joint intentions theory requires that team members jointly commit to a joint persistent goal, G. It also requires that when any team member privately believes that G is achieved (or unachievable or irrelevant), it must then attain mutual belief throughout the team about this achievement (or unachievability or irrelevance). To encode this prescription of joint intentions theory within our COM-MTDP model, we first specify the joint goal, G, as a subset of states, G Í S, where the desired goal is achieved (or unachievable or irrelevant).
Presumably, such a prescription indicates that joint intentions are not specifically intended for individually observable environments. Upon achieving the goal in an individually observable environment, each agent would simultaneously observe that St Î G. Because of our assumption that the COM-MTDP model components (including Oa) are common knowledge to the team, each agent would also simultaneously come to believe that its teammates have observed that St Î G, and that its teammates believe that it believes that all of the team members have observed that St Î G, and so on. Thus, the team immediately attains mutual belief in the achievement of the goal under individual observability without any additional communication necessary by the team.
Instead, the joint intention framework aims at domains with some degree of unobservability. In such domains, the agents must signal the other agents, either through communication or some informative domain-level action, to attain mutual belief. However, we can also assume that joint intention theory does not focus on domains with free communication, where Theorem 2 shows that we can simply have the agents communicate everything, all the time, without the need for more complex prescriptions.
The joint intention framework does not specify a precise communication policy for the attainment of mutual belief. In this paper, we focus on communication only in the case of goal achievement, but our methodology extends to handle unachievability and irrelevance as well. One well-known approach [Jennings, 1995] applied joint intentions theory by having the agents communicate the achievement of the joint goal, G, as soon as they believe G to be true. To instantiate the behavior of Jennings' agents within a COM-MTDP, we construct a communication policy, paSJ, that specifies that an agent sends the special message, sG, when it first believes that G holds. Following joint intentions' assumption of sincerity [Smith & Cohen, 1996], we require that the agents never select the special sG message in a belief state unless they believe G to be true with certainty. With this requirement and with our assumption of the team's common knowledge of the communication model, we can assume that all of the other agents immediately accept the special message, sG, as true, and that the agents know that all their team members accept the message as true, and so on. Thus, the team attains mutual belief that G is true immediately upon receiving the message, sG. We can construct the communication policy, paSJ, in constant time.
The STEAM algorithm is another instantiation of joint intentions that has had success in several real-world domains [Tambe, 1997,Pynadath, Tambe, Chauvat, & Cavedon, 1999, Tambe, Pynadath, Chauvat, Das, KaminkaTambe et al.2000,Pynadath & Tambe, 2002]. Unlike Jennings' instantiation, the STEAM teamwork model includes decision-theoretic communication selectivity. A domain specification includes two parameters for each joint commitment, G: t, the probability of miscoordinated termination of G; and Cmt, the cost of miscoordinated termination of G. In this context, ``miscoordinated termination'' means that some agents immediately observe that the team has achieved G while the rest do not. STEAM's domain specification also includes a third parameter, Cc, to represent the cost of communication of a fact (e.g., the achievement of G). Using these parameters, the STEAM algorithm evaluates whether the expected cost of miscoordination outweighs the cost of communication. STEAM expresses this criterion as the following inequality: t·Cmt > Cc. We can define a communication policy, paSS based on this criterion: if the inequality holds, then an agent that has observed the achievement of G will send the message, sG; otherwise, it will not. We can construct paSS in constant time.
Both policies, paSJ, and paSS consider sending sG only when an agent first believes that G has been achieved. Once an agent has the relevant belief, they make different choices, and we consider here what the optimal decision is at this point. The domain is not individually observable, so certain agents may be unaware of the achievement of G. When not sending the sG message, these unaware agents may unnecessarily continue performing actions in the pursuit of achieving G. The performance of these extraneous actions could potentially incur costs and lead to a lower utility than one would expect when sending the sG message.
The decision to send sG or not matters only if the team achieves
G and one agent comes to know this fact.
We define the random variable, TG, to be the earliest time at which an
agent knows this fact. We denote agent KG as the agent who
knows of the achievement at time TG. If KG=i, for some agent, i, and
TG=t0, then agent i has some pre-communication belief state,
bi·St0=b, that indicates that G has been
achieved. To more precisely quantify the difference between agent i
sending the sG message at time TG vs. never sending it, we
define the following value:
| (1) |
Unfortunately, while Equation 13 identifies an exact
criterion for locally optimal communication, this criterion is not yet
operational. In other words, we can not directly implement it as a
communication policy for the agents. Furthermore,
Equation 13 hides the underlying complexity of the
computation involved, which is one of the key goals of our analysis.
Therefore, we use the COM-MTDP model to derive an operational expression of
DT ³ 0. For simplicity, we define notational shorthand for various
sequences and combinations of values. We
define a partial sequence of random variables, X < t, to be the sequence of
random variables for all times before t: X0, X1, ..., Xt-1.
We make similar definitions for the other relational operators (i.e.,
X > t, X ³ t, etc.). The
expression, (S)T, denotes the cross product over states of the world,
Õt=0TS, as distinguished from the time-indexed random variable,
ST, which denotes the value of the state at time T. The notation,
s ³ t0[t], specifies the element in slot t within the vector
s ³ t0. We define the function, U, as shorthand within
our probability expressions. It allows us to compactly represent a particular
subsequence of world and agent belief states occurring, conditioned on the
current situation, as follows:
| (14) |
| (15) |
Theorem 7
If we assume that, upon achievement of G, no communication other than
sG is possible, then the condition DT(t0,i,b)
³ 0
holds if and only if:
å
s £ t0 Î (S)t0
å
b·S £ t0 Î Bat0
Pr
(U( < 0,t0 > ,s £ t0,b·S £ t0))·
æ
è
å
s ³ t0 Î (S)T-t0+1
å
b·S ³ t0 Î BaT-t0+1
Pr
(U( < t0,T > ,s ³ t0,b·S ³ t0)|Sit0=sG,U( < 0,t0 > ,s £ t0,b·S £ t0))·
T
å
t=t0
RA(s ³ t0[t],paA(bS·(b·S ³ t0[t],paS+sG)))-
å
s ³ t0 Î (S)T-t0+1
å
b·S ³ t0 Î BaT-t0+1
Pr
(U( < t0,T > ,s ³ t0,b·S ³ t0)|Sit0=null,U( < 0,t0 > ,s £ t0,b·S £ t0))·
T
å
t=t0
RA(s ³ t0[t],paA(bS·(b·S ³ t0[t],paS+null)))
ö
ø
³
-
å
s Î G
å
b Î Ba
Pr
(U( < t0,t0 > ,s,b))RS(s,sG) (2)
|
We can rewrite these summations more simply using our various
shorthand notations:
|
Theorem 7 states, informally, that we prefer sending sG whenever the the cost of execution after achieving G outweighs the cost of communication of the fact that G has been achieved. More precisely, the outer summations on the left-hand side of the inequality iterate over all possible past histories of world and belief states, producing a probability distribution over the possible states the team can be in at time t0. For each such state, the expression inside the parentheses computes the difference in domain-level reward, over all possible future sequences of world and belief states, between sending and not sending sG. By our theorem's assumption that no communication other than sG is possible after G has been achieved, we can ignore any communication costs in the future. However, if we relax this assumption, we can extend the left-hand side in a straightforward manner into a longer expression that accounts for the difference in future communication costs as well. Thus, the left-hand side captures our intuition that, when not communicating, the team will incur a cost if the agents other than i are unaware of G's achievement. The right-hand side of the inequality is a summation of the cost of sending the sG message over possible current states and belief states.
We can use Theorem 7 to derive the locally optimal communication decision across various classes of problem domains. Under no communication, we cannot send sG. Under free communication, the right-hand side is 0, so the inequality is always true, and we know to prefer sending sG. Under no assumptions about communication, the determination is more complicated. When the domain is individually observable, the left-hand side becomes 0, because all of the agents know that G has been achieved (and thus there is no difference in execution when sending sG). Therefore, the inequality is always false (unless under free communication), and we prefer not sending sG. When the environment is not individually observable and communication is available but not free, then, to be locally optimal at time t0, agent i must evaluate Inequality 16 in its full complexity. Since the inequality sums rewards over all possible sequences of states and observations, the time complexity of the corresponding algorithm is O((|S|·|Wa|)T). While this complexity is unacceptable for most real-world problems, it still provides an exponential savings over searching the entire policy space for the globally optimal policy, where any agent could potentially send sG at times other than TG. Table 3 provides a table of the complexity required to determine the locally optimal policy under the various domain properties.
Individually | Collectively | Collectively | Non- | |
Observable | Observable | Partially Observable | Observable | |
No Comm. | W(1) | W(1) | W(1) | W(1) |
General Comm. | W(1) | O((|S|·|Wa|)T) | O((|S|·|Wa|)T) | W(1) |
Free Comm. | W(1) | W(1) | W(1) | W(1) |
We can now show that although Theorem 7's algorithm for locally optimal communication provides a significant computational savings over finding the global optimum, it still outperforms existing teamwork models, as exemplified by our paSJ and paSS policies. First, we can use the criterion of Theorem 7 to evaluate the optimality of the policy, paSJ. If DT(t0,i,b) ³ 0 for all possible times t0, agents i, and belief states b that are consistent with the achievement of the goal G, then the locally optimal policy will always specify sending sG. In other words, paSJ will be identical to the locally optimal policy. However, if the inequality of Theorem 7 is ever false, then paSJ is not even locally, let alone globally, optimal.
Second, we can also use Theorem 7 to evaluate STEAM by viewing STEAM's inequality, t·Cmt > Cc, as a crude approximation of Inequality 16. In fact, there is a clear correspondence between the terms in the two inequalities. The left-hand side of Inequality 16 computes an exact expected cost of miscoordination. However, unlike STEAM's monolithic t parameter, the optimal criterion evaluates a complete probability distribution over all possible states of miscoordination by considering all possible past sequences consistent with the agent's current beliefs. Likewise, unlike STEAM's monolithic Cmt parameter, the optimal criterion looks ahead over all possible future sequences of states to determine the true expected cost of miscoordination. Furthermore, we can view STEAM's parameter, Cc, as an approximation of the communication cost computed by the right-hand side of Inequality 16. Again, STEAM uses a single parameter, while the optimal criterion computes an expected cost over all possible states of the world.
STEAM does have some flexibility in its representation, because Cmt, t, and Cc are not necessarily fixed across the entire domain. For instance, Cmt may vary based on the specific joint plan that the agents may have jointly committed to (i.e., there may be a different Cmt for each goal G). Thus, while Theorem 7 suggests significant additional flexibility in computing Cmt through explicit lookahead, the optimal criterion derived with the COM-MTDP model also provides a justification for the overall structure behind STEAM's approximate criterion. Furthermore, STEAM's emphasis on on-line computation makes the computational complexity of Inequality 16 (as presented in Table 3) unacceptable, so the approximation error may be acceptable given the gains in efficiency. For a specific domain, we can use empirical evaluation (as demonstrated in the next section) to quantify the error and efficiency to precisely judge this tradeoff.
This section presents results of a COM-MTDP analysis of an example domain involving agent-piloted helicopters, where we focus on the key communication decision faced by many multiagent frameworks (as described in Section 4), but vary the cost of communication and degree of observability to generate a space of distinct domains with different implications for the agents' performance. By evaluating communication policies over various configurations of this particular testbed domain, we demonstrate a methodology by which one can use the COM-MTDP framework to model any problem domain and to evaluate candidate communication policies for it.
The two agents form a top-level joint commitment, GD, to reach their destination. There is no incentive for the agents to communicate the achievement of this goal, since they will both eventually reach their destination with certainty. However, in the service of their top-level goal, GD, the two agents also adopt a joint commitment, GR, of destroying the radar unit. We consider here the problem facing Escort with respect to communicating the achievement of goal, GR. If Escort communicates the achievement of GR, then Transport knows that it is safe to fly at its normal altitude (thus reaching the destination sooner). If Escort does not communicate the achievement of GR, there is still some chance that Transport will observe the event anyway. If Transport does not observe the achievement of GR, then it must fly nap-of-the-earth the whole distance, and the team receives a lower reward because of the later arrival. Therefore, Escort must weigh the increase in expected reward against the cost of communication.
In the COM-MTDP model of this scenario (presented in Figures 2, 3 and 4), the world state is the position (along a straight line between origin and destination) of Transport, Escort, and the enemy radar. The enemy is at a randomly selected position somewhere in between the agents' initial position and their destination. Transport has no possible communication actions, but it can choose between two domain-level actions: flying nap-of-the-earth and flying at its normal speed and altitude. Escort has two domain-level actions: flying at its normal speed and destroying the radar. Escort also has the option of communicating the special message, sGR, indicating that the radar has been destroyed. In the tables of Figures 2, 3 and 4, the ``·'' symbol represents a wild-card (or ``don't care'') entry.
a | ={Escort (E),Transport (T)} | ||||||||||||||||||||||||||||||||||||||||
S | =XE×XT×XR | ||||||||||||||||||||||||||||||||||||||||
Position of Escort: XE={0,1,¼,8,9,Destination} | |||||||||||||||||||||||||||||||||||||||||
Position of Transport: XT={0,0.5,¼,9,9.5,Destination, | |||||||||||||||||||||||||||||||||||||||||
Destroyed} | |||||||||||||||||||||||||||||||||||||||||
Position of Radar: XR={1,2,¼,8,Destroyed} | |||||||||||||||||||||||||||||||||||||||||
Aa | =AE×AT = {fly,destroy,wait}×{fly-NOE,fly-normal,wait} | ||||||||||||||||||||||||||||||||||||||||
Sa | =SE×ST = {clear (sGR),null}×{null} | ||||||||||||||||||||||||||||||||||||||||
RA( < xE,xT,xR > ,a) | =
| ||||||||||||||||||||||||||||||||||||||||
RS(s, < null,null > ) | =0 | ||||||||||||||||||||||||||||||||||||||||
RS(s, < sGR,null > ) | =-rS Î [0,1] |
|
|
|
|
|
If Escort arrives at the radar, then it observes its presence with certainty and can destroy it to achieve GR. The likelihood of Transport's observing the radar's destruction is a function of its distance from the radar. We can vary this function's observability parameter (l in Figure 4) within the range [0,1] to generate distinct domain configurations (0 means that Transport will never observe the radar's destruction; 1 means Transport will always observe it). If the observability is 1, then they achieve mutual belief of the achievement of GR as soon as it occurs (following the argument presented in Section 4.1). However, for any observability less than 1, there is a chance that the agents will not achieve mutual belief simply by common observation. The helicopters receive a fixed reward for each time step spent at their destination. Thus, for a fixed time horizon, the earlier the helicopters reach there, the greater the team's reward. Since flying nap-of-the-earth is slower than normal speed, Transport will switch to its normal flying as soon as it either observes that GR has been achieved or Escort sends the message, sGR. Sending the message is not free, so we impose a variable communication cost (rS in Figure 2), also within the range [0,1].
We constructed COM-MTDP models of this scenario for each combination of observability and communication cost within the range [0,1] at 0.1 increments. For each combination, we applied the Jennings and STEAM policies, as well as a completely silent policy. For this domain, the policy, paSJ, dictates that Escort always communicate sGR upon destroying the radar. For STEAM, we vary the t and Cc parameters with the observability and communication cost parameters, respectively. We used two different settings (low and medium) for the cost of miscoordination, Cmt. Following the published STEAM algorithm [Tambe, 1997], Escort sends message sGR if and only if STEAM's inequality t·Cmt > Cc, holds. Thus, the two different settings, low and medium, for Cmt generate two distinct communication policies; the high setting is strictly dominated by the other two settings in this domain. We also constructed and evaluated locally and globally optimal policies. In applying each of these policies, we used our COM-MTDP model to compute the expected reward received by the team when following the selected policy. We can uniquely determine this expected reward given the candidate communication policy and the particular observability and communication cost parameters, as well as the COM-MTDP model specified in Figures 2, 3, and 4.
Figures 5 and 6 plot how much utility the team can expect to lose by following the Jennings, silent, and the two STEAM policies instead of the locally optimal communication policy (thus, higher values mean worse performance). We can immediately see that the Jennings and silent policies are significantly suboptimal for many possible domain configurations. For example, not surprisingly, the surface for the policy, paSJ, peaks (i.e., it does most poorly) when the communication cost is high and when the observability is high, while the silent policy does poorly under exactly the opposite conditions.
Previously published results [Jennings, 1995] demonstrated that the Jennings policy led to better team performance by reducing waste of effort produced by alternate policies like our silent one. These earlier results focused on a single domain, and Figure 5 partially confirms their conclusion and shows that the superiority of the Jennings policy over the silent policy extends over a broad range of possible domain configurations. On the other hand, our COM-MTDP results also show that there is a significant subclass of domains (e.g., when communication cost and observability are high) where the Jennings policy is actually inferior to the silent policy. Thus, with our COM-MTDP model, we can characterize the types of domains where the Jennings policy outperforms the silent policy and vice versa.
Figure 6 shows the expected value lost by following the two STEAM policies. We can view STEAM as trying to intelligently interpolate between the Jennings and silent policies based on the particular domain properties. In fact, under a low setting for Cmt, we see two thresholds, one along each dimension, at which STEAM switches between following the Jennings and silent policies, and its suboptimality is highest at these thresholds. Under a medium setting for Cmt, STEAM does not exhibit a threshold along the dimension of communication cost, due to the increased cost of miscoordination. Under both settings, STEAM's performance generally follows the better of those two fixed policies, so its maximum suboptimality (0.587 under both settings) is significantly lower than that of the silent (0.700) and Jennings' (1.000) policies. Furthermore, STEAM outperforms the two policies on average, across the space of domain configurations, as evidenced by its mean suboptimality of 0.063 under low Cmt and 0.083 under medium Cmt. Both values are significantly lower than the silent policy's mean of 0.160 and the Jennings' policy's mean of 0.161. Thus, we have been able to quantify the savings provided by STEAM over less selective policies within this example domain.
However, within a given domain configuration, STEAM must either always or never communicate, and this inflexibility leads to significant suboptimality across a wide range of domain configurations. On the other hand, Figure 6 also shows that there are domain configurations where STEAM is locally optimal. In this relatively small-scale experimental testbed, there is no need to incur STEAM's suboptimality, because the agents can compute the superior locally optimal policy in under 5 seconds. In larger-scale domains, on the other hand, the increased complexity of the locally optimal policies may render its execution infeasible. In such domains, STEAM's constant-time execution would potentially make it a preferable alternative. This analysis suggests a possible spectrum of algorithms that make different optimality-efficiency tradeoffs.
To understand the cause of STEAM's suboptimality, we can examine its performance more deeply in Figures 7 and 8, which plot the expected number of messages sent using STEAM (with both low and medium Cmt) vs. the locally optimal policy, at observability values of 0.3 and 0.7. STEAM's expected number of messages is either 0 or 1, so STEAM can make at most two (instantaneous) transitions between them: one threshold value each along the observability and communication cost dimensions.
From Figures 7 and 8, we see that the optimal policy can be more flexible than STEAM by specifying communication contingent on Escort's beliefs beyond simply the achievement of GR. For example, consider the messages sent under low Cmt in Figure 7, where STEAM matches the locally optimal policy at the extremes of the communication cost dimension. Even if the communication cost is high, it is still worth sending message sGR in states where Transport is still very far from the destination. Thus, the surface for the optimal policy, makes a more gradual transition from always communicating to never communicating. We can thus view STEAM's surface as a crude approximation to the optimal surface, subject to STEAM's fewer degrees of freedom.
We can also use Figures 7 and 8 to identify the domain conditions under which joint intentions theory's prescription of attaining mutual belief is or is not optimal. In particular, for any domain where the observability is less than 1, the agents will not attain mutual belief without communication. In both Figures 7 and 8, there are many domain configurations where the locally optimal policy is expected to send fewer than 1 sGR message. Each of these configurations represents a domain where the locally optimal policy will not attain mutual belief in at least one case. Therefore, attaining mutual belief is suboptimal in those configurations!
These experiments illustrate that STEAM, despite its decision-theoretic communication selectivity, may communicate suboptimally under a significant class of domain configurations. Previous work on STEAM-based, real-world, agent-team implementations informally noted suboptimality in an isolated configuration within a more realistic helicopter transport domain [Tambe, 1997]. Unfortunately, this previous work treated that suboptimality (where the agents communicated more than necessary) as an isolated aberration, so there was no investigation of the degree of such suboptimality, nor of the conditions under which such suboptimality may occur in practice. We re-created these conditions within the experimental testbed of this section by using a medium Cmt. The resulting experiments (as shown in Figure 7) illustrated that the observed suboptimality was not an isolated phenomenon, but, in fact, that STEAM has a general propensity towards extraneous communication in situations involving low observability (i.e., low likelihood of mutual belief) and high communication costs. This result matches the situation where the ``aberration'' occurred in the more realistic domain.
The locally optimal policy is itself suboptimal with respect to the globally optimal policy, as we can see from Figure 9. Under domain configurations with high observability, the globally optimal policy has the escort wait an additional time step after destroying the radar and then communicate only if the transport continues flying nap-of-the-earth. The escort cannot directly observe which method of flight the transport has chosen, but it can measure the change in the transport's position (since it maintains a history of its past observations) and thus infer the method of flight with complete accuracy. In a sense, the escort following the globally optimal policy is performing plan recognition to analyze the transport's possible beliefs. It is particularly noteworthy that our domain specification does not explicitly encode this recognition capability. In fact, our algorithm for finding the globally optimal policy does not even make any of the assumptions made by our locally observable policy (i.e., single agent is deciding whether to communicate or not, regarding a single message, at a single point in time); rather, our general-purpose search algorithm traverses the policy space and ``discovers'' this possible means of inference on its own. We expect that such COM-MTDP analysis can provide an automatic method for discovering novel communication policies of this type in other domains, even those modeling real-world problems.
Indeed, by exploiting this discovery capability within our example domain, the globally optimal policy gains a slight advantage in expected utility over the locally optimal policy, with a mean difference of 0.011, standard deviation of 0.027, and maximum of 0.120. On the other hand, our domain-independent code never requires more than 5 seconds to compute the locally optimal policy in this testbed, while our domain-independent search algorithm always required more than 150 minutes to find the globally optimal policy. Thus, through Theorem 7, we have used the COM-MTDP model to construct a communication policy that, for this testbed domain, performs almost optimally and outperforms existing teamwork theories, with a substantial computational savings over finding the globally optimal policy. Although these results hold for an isolated communication decision, we expect the relative performance of the policies to stay the same even with multiple decisions, where the inflexibility of the suboptimal policies will only exacerbate their losses (i.e., the shapes of the graphs would stay roughly the same, but the suboptimality magnitudes would increase).
The COM-MTDP framework provides a general methodology for analysis across both general domain subclasses and specific domain instantiations. As demonstrated in Section 4, we can express important existing teamwork theories within a COM-MTDP framework and derive broadly applicable theoretical results about their optimality. Section 5 demonstrates our methodology for the analysis of a specific domain. By encoding a teamwork problem as a COM-MTDP, we can use the leverage of our general-purpose software tools (available in Online Appendix 1) to evaluate the optimality of teamwork based on potentially any other existing theory, as demonstrated in this paper using two leading instantiations of joint intentions theory. In combining both theory and practice, we can use the theoretical results derived using the COM-MTDP framework as the basis for new algorithms to extend our software tools, just as we did in translating Theorem 7 from Section 4 into an implemented algorithm for locally optimal communication in Section 5. We expect that the COM-MTDP framework, the theorems and complexity results, and the reusable software will form a basis for further analysis of teamwork, both by ourselves and others in the field.
While our initial COM-MTDP results are promising, there remain at least three key areas where future progress in COM-MTDPs is critical. First, analysis using COM-MTDPs (such as the one presented in Section 5) requires knowledge of the rewards, transition probabilities, and observation probabilities, as well as of the competing policies governing agent behavior. It may not always be possible to have such a model of the domain and agents' policies readily available. Indeed, other proposed team-analysis techniques [Nair, Tambe, Marsella, & Raines, 2002b,Raines, Tambe, & Marsella, 2000], do not require a priori hand-coding of such models, but rather acquire them automatically through machine learning over large numbers of runs. Also, in the interests of combating computational complexity and improved understandability, some researchers emphasize the need for multiple models at multiple levels of abstraction, rather than focusing on a single model [Nair, Tambe, Marsella, & Raines, 2002b]. For instance, one level of the model may focus on the analysis of the individual agents' actions in support of a team, while another level may focus on interactions among subteams of a team. We can potentially extend the COM-MTDP model in both of these directions (i.e., machine learning of model parameters, and hierarchical representations of the team to provide multiple levels of abstraction).
Second, it is important to extend COM-MTDP analysis to other aspects of teamwork beyond communication. For instance, team formation (where agents may be assigned specific roles within the team) and reformation (where failure of individual agents leads to role reassignment within in the team) are key problems in teamwork that appear suitable for COM-MTDP analysis. Such analysis may require extensions to the COM-MTDP framework (e.g., explicit modeling of roles). Ongoing research [Nair, Tambe, & Marsella, 2002a] has begun investigating the impact of such extensions and their applications in domains such as RoboCup Rescue [Kitano, Tadokoro, Noda, Matsubara, Takahashi, Shinjoh, & Shimada, 1999]. Analysis of more complex team behaviors may require further extensions to the COM-MTDP model to explicitly account for additional aspects of teamwork (e.g., notions of authority structure within teams).
Third, extending COM-MTDP analysis beyond teamwork to model other types of coordination may require relaxation of COM-MTDP's assumption of selfless agents receiving the same joint reward. More complex organizations may require modeling other non-joint rewards. Indeed, enriching the COM-MTDP model in this manner may enable analysis of some of the seminal work in multiagent coordination in the tradition of PGP and GPGP [Decker & Lesser, 1995,Durfee & Lesser, 1991]. Such enriched models may first require new advances in the mathematical foundations of our COM-MTDP framework, and ultimately contribute towards the emerging sciences of agents and multiagent systems.