We begin with the most conceptually straightforward method: first, learn the T and R functions by exploring the environment and keeping statistics about the results of each action; next, compute an optimal policy using one of the methods of Section 3. This method is known as certainty equivlance [57].
There are some serious objections to this method:
Figure: In this environment, due to
Whitehead [130], random exploration would take
take steps to reach the goal even once, whereas a more
intelligent exploration strategy (e.g. ``assume any untried action
leads directly to goal'') would require only steps.
A variation on this idea is certainty equivalence, in which the model is learned continually through the agent's lifetime and, at each step, the current model is used to compute an optimal policy and value function. This method makes very effective use of available data, but still ignores the question of exploration and is extremely computationally demanding, even for fairly small state spaces. Fortunately, there are a number of other model-based algorithms that are more practical.