We simulate this situation by placing a ball and a stationary car acting as the ``teammate'' in specific places on the field. Then we place another car, the ``defender,'' in front of the goal. The defender moves in a small circle in front of the goal at some speed and begins at some random point along this circle. The learning agent (``you'') must take one of two possible actions: shoot straight towards the goal, or pass to the teammate so that the ball will rebound towards the goal. A snapshot of the experimental setup is shown graphically in Figure 1.
The task is essentially to learn two functions, each with one continuous input variable, namely the defender's position. Based on this position, which can be represented unambiguously as the angle at which it is facing, , the agent tries to learn the probability of scoring when shooting, , and the probability of scoring when passing, . If these functions were learned completely, which would only be possible if the defender's motion were deterministic, then both functions would be binary partitions: . That is, the agent would know without doubt for any given whether a shot, a pass, both, or neither would achieve its goal. However, since the agent cannot have had experience for every possible , and since the defender may not move at the same speed each time, the learned functions must be approximations: .
In order to enable the agent to learn approximations to the functions and , we gave it a memory in which it could store its experiences and from which it could retrieve its current approximations and . We explored and developed appropriate methods of storing to and retrieving from memory and an algorithm for deciding what action to take based on the retrieved values.