Williams [131, 132] studied the problem of choosing actions to maximize immedate reward. He identified a broad class of update rules that perform gradient descent on the expected reward and showed how to integrate these rules with backpropagation. This class, called REINFORCE algorithms, includes linear reward-inaction (Section 2.1.3) as a special case.
The generic REINFORCE update for a parameter can be written
where is a non-negative factor, r the current reinforcement, a reinforcement baseline, and is the probability density function used to randomly generate actions based on unit activations. Both and can take on different values for each , however, when is constant throughout the system, the expected update is exactly in the direction of the expected reward gradient. Otherwise, the update is in the same half space as the gradient but not necessarily in the direction of steepest increase.
Williams points out that the choice of baseline, , can have a profound effect on the convergence speed of the algorithm.