A standard reinforcement learning algorithm, applied to a series of related tasks, could learn each new task independently. It only requires knowledge of its present state and infrequent numerical rewards to learn the actions necessary to bring a system to some desired goal state. But this very paucity of knowledge results in a slow learning rate. This paper shows how to exploit the results of prior learning to speed up the process while maintaining the robustness of the general learning method.
The system proposed here achieves much of its power by transferring parts of previously learned solutions, rather than a single complete solution. The solution pieces represent knowledge about how to solve certain subtasks. We might call them macro-actions (Precup, Sutton, & Singh 1997), with the obvious allusion to macro-operators commonly found in Artificial Intelligence systems. The main contribution of this work is in providing a way of automatically identifying these macro-actions and mapping them to new tasks.
This work uses syntactic methods of composition much like in symbolic planning, but the novelty arises in that the parts being composed are multi-dimensional real-valued functions. These functions are learned using reinforcement learning as part of more complex functions associated with compound tasks. The efficacy of this approach is due to the composition occurring at a sufficiently abstract level, where much of the uncertainty has been removed. Each function acts much like a funnel operator (Christiansen 1992), so although individual actions may be highly uncertain, the overall result is largely predictable.
The subtasks are identified on the basis of strong features in the multi-dimensional function that arise during reinforcement learning. The features are not ``in the world'', but in the system's interaction with the world. Here, ``strong'' means that the features are stable (i.e. relatively insensitive to variations in the low level learning process) and easy to recognize and locate accurately early in the learning process. One important aspect of these features is that they largely dictate the shape of the function. If the features differ by a small amount, one would expect the function to differ by a small amount.
The features generate a partitioning of the function. A popular technique in object recognition, the snake (Suetens, Fua, & Hanson, 1992; Kass, Witkin, & Terzopoulus, 1987), is used to produce this partition. In object recognition, the snake produces a closed curve that lies along the boundary of an object, as defined by edges in an image. In this application, the snake groups together sets of features to define a region of the function. The boundary of the region is a low order polygon, demarcating an individual subtask. This is repeated until the whole function is covered. The polygons are converted into discrete graphs, a vertex of the polygon becoming a node of the graph. Merging these graphs produces a composite graph representing the whole task.
The composite graph is used to control the transfer by accessing a case base of previously learned functions. The case base is indexed by graphs. The relevant function is determined by matching a subgraph of the composite graph with one acting as an index to a case. The associated functions are transformed and composed to form a solution to the new task. This is used to reinitialize the lower level learning process. It is not necessary for the transfer to produce an exact solution for the new task. It is sufficient that the solution is close enough to the final solution often enough to produce an average speed up. Reinforcement learning will further refine the function and quickly remove any error.
This paper demonstrates the applicability of transfer in two different situations. In the first, the system learns a task for a particular goal position and then the goal is moved. Although the function itself will change significantly, the partition generated on the initial task can be used to compose the function for the new task. In the second situation considered, the system is placed in a different environment within the same domain. Here, a new partition has to be extracted to control the composition process.
This paper unifies and significantly extends previous work by the author (Drummond, 1998; Drummond, 1997). Additional work has largely focussed on removing some of the limitations inherent in the partitioning approach introduced in Drummond (1998). One limitation of the original approach was that the snake could only extract polygons that were rectangles. This paper relaxes this restriction, allowing it to be applied to a different environment within the same domain and to a different task domain. Although lifting this restriction removes some desirable bias, the experiments demonstrate that none of the efficacy of the original system is lost. Further, the results are more broadly obtained on the larger set of related tasks and in a different domain. Overall, the function composition approach often produces more than an order of magnitude increase in learning rate when compared to a basic reinforcement learning algorithm.
The rest of the paper begins with Section 2 giving a very high level discussion of the approach taken. Section 3 gives a more in depth discussion of the techniques used. Sections 4 and 5 present and analyze the experimental results. Subsequent sections deal with limitations and related research.