Transfer

When linear generalization from the input features does not suffice to produce good performance, the cost of generating training data is likely to make STAGE inferior to non-learning optimization algorithms on any given problem instance. However, if the input features can be made suitably instance-independent, then the cost of training an evaluation function can be amortized over the whole family of applicable instances. This was exactly the tradeoff found to be profitable in [Zhang1996], as discussed in Section 3.2.

To investigate how problem-dependent learned evaluation functions are in channel routing, we re-ran experiment (G) on a suite of eight problems from the literature [Chao and Harper1996]. Table 3 summarizes the results and gives the coefficients of the linear evaluation function learned (independently) for each problem.

Problem lower best best learned coefficients

instance bound hillcl. STAGE < 1, U, p, w >

YK4 10 41 14 <-7.55, -141.2, 7.94, 141.7 >

HYC1 8 8 8 <-0.19, -2.12, 0.86, 2.36>

HYC2 9 9 9 < 0.05, -2.78, 0.34, 2.66>

HYC3 11 14 12 <-0.09, -5.06, 0.38, 5.02>

HYC4 20 49 30 <-3.96, -49.62, 0.33, 49.14>

HYC5 35 47 41 <-1.68, -7.30, 1.18, 7.08>

HYC6 50 75 59 <-1.38, -9.59, 0.72, 9.48>

HYC7 39 87 83 <-5.35, -49.26, -1.38, 48.03>

HYC8 21 55 33 <-0.79, -32.96, 0.11, 32.78>

Table 3: STAGE results on eight problems from [Chao and Harper1996].

**Table 3:** STAGE results on eight problems from [Chao and Harper1996].
Problem	lower	best	best	learned coefficients
instance	bound	hillcl.	STAGE	< 1, U, p, w >
YK4	10	41	14	<-7.55, -141.2, 7.94, 141.7 >
HYC1	8	8	8	<-0.19, -2.12, 0.86, 2.36>
HYC2	9	9	9	< 0.05, -2.78, 0.34, 2.66>
HYC3	11	14	12	<-0.09, -5.06, 0.38, 5.02>
HYC4	20	49	30	<-3.96, -49.62, 0.33, 49.14>
HYC5	35	47	41	<-1.68, -7.30, 1.18, 7.08>
HYC6	50	75	59	<-1.38, -9.59, 0.72, 9.48>
HYC7	39	87	83	<-5.35, -49.26, -1.38, 48.03>
HYC8	21	55	33	<-0.79, -32.96, 0.11, 32.78>

The similarities among the learned evaluation functions are striking. Like the hand-tuned cost function C of [Wong et al. 1988] (Equation 7), all the STAGE-learned cost functions assigned a relatively small positive weight to feature p, except on instance HYC7 (where STAGE's learning helped least). Unlike function C, they all assigned a large negative weight to feature U. Since the behavior of a linear evaluation function is invariant under linear transformations, the similarity of these functions suggests that transfer between problem instances will be fruitful. As a first test of transfer, I ran STAGE on HYC7 using the value function learned from HYC6. This did indeed help, producing a solution of size 71 instead of 83.

The assignment of a negative coefficient to U is surprising, because U measures the sparsity of the horizontal tracks. U correlates strongly positively with the objective function to be minimized; a term of -U in the evaluation function ought to pull the search toward terrible solutions in which each subnet occupies its own track. However, the positive coefficient on w cancels out this bias, and in fact a proper balance between the two terms can be shown to bias search toward solutions with an uneven distribution of track sparsities. Although this characteristic is not itself the mark of a high-quality solution, it does help lead hillclimbing search to high-quality solutions. STAGE successfully discovered and exploited this predictive combination of features.