Certainty Equivalent Methods

Next: Dyna Up: Computing Optimal Policies by Previous: Computing Optimal Policies by

Certainty Equivalent Methods

We begin with the most conceptually straightforward method: first, learn the T and R functions by exploring the environment and keeping statistics about the results of each action; next, compute an optimal policy using one of the methods of Section 3. This method is known as certainty equivlance [57].

There are some serious objections to this method:

It makes an arbitrary division between the learning phase and the acting phase.
How should it gather data about the environment initially? Random exploration might be dangerous, and in some environments is an immensely inefficient method of gathering data, requiring exponentially more data [130] than a system that interleaves experience gathering with policy-building more tightly [56]. See Figure 5 for an example.
The possibility of changes in the environment is also problematic. Breaking up an agent's life into a pure learning and a pure acting phase has a considerable risk that the optimal controller based on early life becomes, without detection, a suboptimal controller if the environment changes.

Figure: In this environment, due to Whitehead [130], random exploration would take take steps to reach the goal even once, whereas a more intelligent exploration strategy (e.g. ``assume any untried action leads directly to goal'') would require only steps.

A variation on this idea is certainty equivalence, in which the model is learned continually through the agent's lifetime and, at each step, the current model is used to compute an optimal policy and value function. This method makes very effective use of available data, but still ignores the question of exploration and is extremely computationally demanding, even for fairly small state spaces. Fortunately, there are a number of other model-based algorithms that are more practical.

Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996