Although Dyna is a great improvement on previous methods, it suffers from being relatively undirected. It is particularly unhelpful when the goal has just been reached or when the agent is stuck in a dead end; it continues to update random state-action pairs, rather than concentrating on the ``interesting'' parts of the state space. These problems are addressed by prioritized sweeping [83] and Queue-Dyna [87], which are two independently-developed but very similar techniques. We will describe prioritized sweeping in some detail.
The algorithm is similar to Dyna, except that updates are no longer chosen at random and values are now associated with states (as in value iteration) instead of state-action pairs (as in Q-learning). To make appropriate choices, we must store additional information in the model. Each state remembers its predecessors: the states that have a non-zero transition probability to it under some action. In addition, each state has a priority, initially set to zero.
Instead of updating k random state-action pairs, prioritized sweeping updates k states with the highest priority. For each high-priority state s, it works as follows:
If we have updated the V value for state s' and it has changed by amount , then the immediate predecessors of s' are informed of this event. Any state s for which there exists an action a such that has its priority promoted to , unless its priority already exceeded that value.
The global behavior of this algorithm is that when a real-world transition is ``surprising'' (the agent happens upon a goal state, for instance), then lots of computation is directed to propagate this new information back to relevant predecessor states. When the real-world transition is ``boring'' (the actual result is very similar to the predicted result), then computation continues in the most deserving part of the space.
Running prioritized sweeping on the problem in Figure 6, we see a large improvement over Dyna. The optimal policy is reached in about half the number of steps of experience and one-third the computation as Dyna required (and therefore about 20 times fewer steps and twice the computational effort of Q-learning).