Dyna

Next: Prioritized Sweeping / Queue-Dyna Up: Computing Optimal Policies by Previous: Certainty Equivalent Methods

Dyna

Sutton's Dyna architecture [116, 117] exploits a middle ground, yielding strategies that are both more effective than model-free learning and more computationally efficient than the certainty-equivalence approach. It simultaneously uses experience to build a model ( and ), uses experience to adjust the policy, and uses the model to adjust the policy.

Dyna operates in a loop of interaction with the environment. Given an experience tuple , it behaves as follows:

Update the model, incrementing statistics for the transition from s to s' on action a and for receiving reward r for taking action a in state s. The updated models are and .
Update the policy at state s based on the newly updated model using the rule

which is a version of the value-iteration update for Q values.
Perform k additional updates: choose k state-action pairs at random and update them according to the same rule as before:
Choose an action a' to perform in state s', based on the Q values but perhaps modified by an exploration strategy.

The Dyna algorithm requires about k times the computation of Q-learning per instance, but this is typically vastly less than for the naive model-based method. A reasonable value of k can be determined based on the relative speeds of computation and of taking action.

Figure 6 shows a grid world in which in each cell the agent has four actions (N, S, E, W) and transitions are made deterministically to an adjacent cell, unless there is a block, in which case no movement occurs. As we will see in Table 1, Dyna requires an order of magnitude fewer steps of experience than does Q-learning to arrive at an optimal policy. Dyna requires about six times more computational effort, however.

Figure 6: A 3277-state grid world. This was formulated as a shortest-path reinforcement-learning problem, which yields the same result as if a reward of 1 is given at the goal, a reward of zero elsewhere and a discount factor is used.

Steps before Backups before

convergence convergence

Q-learning 531,000 531,000

Dyna 62,000 3,055,000

prioritized sweeping 28,000 1,010,000

Table 1: The performance of three algorithms described in the text. All methods used the exploration heuristic of ``optimism in the face of uncertainty'': any state not previously visited was assumed by default to be a goal state. Q-learning used its optimal learning rate parameter for a deterministic maze: . Dyna and prioritized sweeping were permitted to take k = 200 backups per transition. For prioritized sweeping, the priority queue often emptied before all backups were used.

**Table 1:** The performance of three algorithms described in the text. All methods used the exploration heuristic of ``optimism in the face of uncertainty'': any state not previously visited was assumed by default to be a goal state. Q-learning used its optimal learning rate parameter for a deterministic maze: . Dyna and prioritized sweeping were permitted to take k = 200 backups per transition. For prioritized sweeping, the priority queue often emptied before all backups were used.
	Steps before	Backups before
	convergence	convergence
Q-learning	531,000	531,000
Dyna	62,000	3,055,000
prioritized sweeping	28,000	1,010,000

Next: Prioritized Sweeping / Queue-Dyna Up: Computing Optimal Policies by Previous: Certainty Equivalent Methods

Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996