Sutton's Dyna architecture [116, 117] exploits a middle ground, yielding strategies that are both more effective than model-free learning and more computationally efficient than the certainty-equivalence approach. It simultaneously uses experience to build a model ( and ), uses experience to adjust the policy, and uses the model to adjust the policy.
Dyna operates in a loop of interaction with the environment. Given an experience tuple , it behaves as follows:
which is a version of the value-iteration update for Q values.
The Dyna algorithm requires about k times the computation of Q-learning per instance, but this is typically vastly less than for the naive model-based method. A reasonable value of k can be determined based on the relative speeds of computation and of taking action.
Figure 6 shows a grid world in which in each cell the agent has four actions (N, S, E, W) and transitions are made deterministically to an adjacent cell, unless there is a block, in which case no movement occurs. As we will see in Table 1, Dyna requires an order of magnitude fewer steps of experience than does Q-learning to arrive at an optimal policy. Dyna requires about six times more computational effort, however.
Figure 6: A 3277-state grid world. This was formulated as a
shortest-path reinforcement-learning problem, which yields the same
result as if a reward of 1 is given at the goal, a reward of zero
elsewhere and a discount factor is used.
Steps before | Backups before | |
convergence | convergence | |
Q-learning | 531,000 | 531,000 |
Dyna | 62,000 | 3,055,000 |
prioritized sweeping | 28,000 | 1,010,000 |