This thesis will investigate new machine learning algorithms for generating evaluation functions automatically. The algorithms are simulation-based: they gather data by simulating trajectories through the search space, and use the outcomes of those trajectories to improve the evaluation function. The learned evaluation function may be applied either to guide further exploration of the same space, or to improve performance in new problem spaces which share similar features.
Two general families of learning algorithms apply here: reinforcement
learning and meta-optimization. The reinforcement learning approach,
dating back to Samuel's checkers player [Samuel1959] but with more
recent successes on backgammon [Tesauro1992] and job-shop
scheduling [Zhang and Dietterich1995], is based on approximating a special
evaluation function known as the value function. The value
function is defined to predict the best possible long-term outcome
expected from each state; a greedy search guided by it will produce an
optimal result. Learning the value function in combinatorially large
domains, however, is quite challenging. The currently-popular
approximation algorithms, Q-learning and TD( ) with neural networks,
run very slowly and may even be unstable [Boyan and Moore1995].
Section 2 of this proposal reviews the literature on value
function approximation and describes an original algorithm for more
efficient learning. Section 3 focuses specifically on the
application of value function approximation to combinatorial
optimization problems, and presents another new algorithm.
The second approach, meta-optimization, is conceptually simpler: we assume a fixed parametric form for the evaluation function and optimize it directly. This approach was also used by Samuel [Samuel1959] and has been applied in the simulated annealing community [Ochotta1994]. Meta-optimization methods, lacking the theoretical advantages of dynamic programming, have been largely ignored by the reinforcement-learning community; however, recent advances in local optimization [Moore and Schneider1996] may help make them superior to reinforcement learning in practice. My research will lay out a framework for understanding both methods more formally; investigate improvements in both; and evaluate empirically which approach works better. Section 4 reviews the literature in meta-optimization and outlines the experiments I plan to carry out.
I will apply these techniques to all or most of the following large-scale combinatorial domains: VLSI channel routing; treatment planning for radiation therapy; scheduling a factory production line; configuring the subsystems of the NASA Outer Planet Orbiter; and move selection in the game of backgammon. These applications are described in Section 5. I will also validate my new algorithms on standard problems from the optimization and control literatures. Finally, a byproduct of this research will be a software package for public use.