In the simplest case, the entire system learns only to optimize immediate reward. First, let us consider the behavior of the network that learns the policy, a mapping from a vector describing s to a 0 or 1. If the output unit has activation , then a, the action generated, will be 1 if , where is normal noise, and 0 otherwise.
The adjustment for the output unit is, in the simplest case,
where the first factor is the reward received for taking the most recent action and the second encodes which action was taken. The actions are encoded as 0 and 1, so a - 1/2 always has the same magnitude; if the reward and the action have the same sign, then action 1 will be made more likely, otherwise action 0 will be.
As described, the network will tend to seek actions that given positive reward. To extend this approach to maximize reward, we can compare the reward to some baseline, b. This changes the adjustment to
where b is the output of the second network. The second network is trained in a standard supervised mode to estimate r as a function of the input state s.
Variations of this approach have been used in a variety of applications [4, 9, 61, 114].