The work of the two components of AHC can be accomplished in a unified manner by Watkins' Q-learning algorithm [128, 129]. Q-learning is typically easier to implement. In order to understand Q-learning, we have to develop some additional notation. Let be the expected discounted reinforcement of taking action a in state s, then continuing by choosing actions optimally. Note that is the value of s assuming the best action is taken initially, and so . can hence be written recursively as
Note also that, since , we have as an optimal policy.
Because the Q function makes the action explicit, we can estimate the Q values on-line using a method essentially the same as TD(0), but also use them to define the policy, because an action can be chosen just by taking the one with the maximum Q value for the current state.
The Q-learning rule is
where is an experience tuple as described earlier. If each action is executed in each state an infinite number of times on an infinite run and is decayed appropriately, the Q values will converge with probability 1 to [128, 125, 49]. Q-learning can also be extended to update states that occurred more than one step previously, as in [88].
When the Q values are nearly converged to their optimal values, it is appropriate for the agent to act greedily, taking, in each situation, the action with the highest Q value. During learning, however, there is a difficult exploitation versus exploration trade-off to be made. There are no good, formally justified approaches to this problem in the general case; standard practice is to adopt one of the ad hoc methods discussed in section 2.2.
AHC architectures seem to be more difficult to work with than Q-learning on a practical level. It can be hard to get the relative learning rates right in AHC so that the two components converge together. In addition, Q-learning is exploration insensitive: that is, that the Q values will converge to the optimal values, independent of how the agent behaves while the data is being collected (as long as all state-action pairs are tried often enough). This means that, although the exploration-exploitation issue must be addressed in Q-learning, the details of the exploration strategy will not affect the convergence of the learning algorithm. For these reasons, Q-learning is the most popular and seems to be the most effective model-free algorithm for learning from delayed reinforcement. It does not, however, address any of the issues involved in generalizing over large state and/or action spaces. In addition, it may converge quite slowly to a good policy.