Measuring Learning Performance

Next: Reinforcement Learning and Adaptive Up: Introduction Previous: Models of Optimal Behavior

Measuring Learning Performance

The criteria given in the previous section can be used to assess the policies learned by a given algorithm. We would also like to be able to evaluate the quality of learning itself. There are several incompatible measures in use.

Eventual convergence to optimal. Many algorithms come with a provable guarantee of asymptotic convergence to optimal behavior [129]. This is reassuring, but useless in practical terms. An agent that quickly reaches a plateau at 99% of optimality may, in many applications, be preferable to an agent that has a guarantee of eventual optimality but a sluggish early learning rate.
Speed of convergence to optimality. Optimality is usually an asymptotic result, and so convergence speed is an ill-defined measure. More practical is the speed of convergence to near-optimality. This measure begs the definition of how near to optimality is sufficient. A related measure is level of performance after a given time, which similarly requires that someone define the given time.
It should be noted that here we have another difference between reinforcement learning and conventional supervised learning. In the latter, expected future predictive accuracy or statistical efficiency are the prime concerns. For example, in the well-known PAC framework [127], there is a learning period during which mistakes do not count, then a performance period during which they do. The framework provides bounds on the necessary length of the learning period in order to have a probabilistic guarantee on the subsequent performance. That is usually an inappropriate view for an agent with a long existence in a complex environment.
In spite of the mismatch between embedded reinforcement learning and the train/test perspective, Fiechter [39] provides a PAC analysis for Q-learning (described in Section 4.2) that sheds some light on the connection between the two views.
Measures related to speed of learning have an additional weakness. An algorithm that merely tries to achieve optimality as fast as possible may incur unnecessarily large penalties during the learning period. A less aggressive strategy taking longer to achieve optimality, but gaining greater total reinforcement during its learning might be preferable.
Regret. A more appropriate measure, then, is the expected decrease in reward gained due to executing the learning algorithm instead of behaving optimally from the very beginning. This measure is known as regret [12]. It penalizes mistakes wherever they occur during the run. Unfortunately, results concerning the regret of algorithms are quite hard to obtain.

Next: Reinforcement Learning and Adaptive Up: Introduction Previous: Models of Optimal Behavior

Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996