Task 2: The Game of Pig

``Pig'' is a two-player children's dice game. Each player starts with a total score of zero, which is increased on each turn by dice rolling. The first to 100 wins. On her turn, a player accumulates a subtotal by repeatedly rolling a 6-sided die. If at any time she rolls a 1, however, she loses the subtotal and gets only 1 added to her total. Thus, before each roll, she must decide whether to (a) add her currently-accumulated subtotal to her permanent total and pass the turn to the other player; or (b) continue rolling, risking an unlucky 1.

Pig belongs to the class of symmetric, alternating, Markov games. This means that the minimax-optimal value function can be formulated as the unique solution to a system of generalized Bellman equations [Littman and Szepesvári1996] similar to Equation 4. The state space, with two-player symmetry factored out, has 515,000 positions--large enough to be interesting, but small enough that computing the exact

is tractable.

For input to the function approximator, we represent states by their natural 3-dimensional feature representation: X's total, O's total, and X's current subtotal. The approximator is a standard MLP with two hidden units. In ROUT, the network is retrained to convergence (at most 1000 epochs) each time the training set is augmented. Note that this extra cost of ROUT is not reflected in the results table, but for practical applications, a far faster approximator than backprop would be used with ROUT.

# training total eval- RMS RMS Policy

Problem Method samples uations Bellman Quality

HOP Discrete 12 21 0 0 -24

ROUT 4 158 0. 0. -24

TD(0) 5000 10,000 0.03 0.1 -24

TD(1) 5000 10,000 0.03 0.1 -24

PIG Discrete 515,000 3.6M 0 0 0

ROUT 133 0.8M 0.09 0.14 -0.093

TD(0) + explore 5 M 30 M 0.23 0.29 -0.151

TD(0.8) + explore 9 M 60 M 0.23 0.33 -0.228

TD(1) + explore 6 M 40 M 0.22 0.30 -0.264

TD(0) no explore 8+ M 50+ M 0.12 0.54 -0.717

TD(0.8) no explore 5 M 35 M 0.33 0.44 -0.308

TD(1) no explore 5 M 30 M 0.23 0.32 -0.186

BAND Discrete 736,281 4 M 0 0 0.682

ROUT 30 15,850 0.01 0.05 0.668

TD(0) 150,000 900,000 0.07 0.14 0.666

TD(1) 100,000 600,000 0.02 0.04 0.669

Table: Summary of results. For each algorithm on each problem, we list two measurements of time to quiescence followed by three measurements of the solution quality. The measurements for TD were taken at the time when, roughly, best performance was first consistently reached. (Key: M= ; * denotes optimal performance for each task.)

**Table:** Summary of results. For each algorithm on each problem, we list two measurements of time to quiescence followed by three measurements of the solution quality. The measurements for TD were taken at the time when, roughly, best performance was first consistently reached. (Key: M= ; * denotes optimal performance for each task.)
		# training	total eval-	RMS	RMS	Policy
Problem Method	samples	uations	Bellman		Quality
HOP	Discrete	12	21	0	0	-24
	ROUT	4	158	0.	0.	-24
	TD(0)	5000	10,000	0.03	0.1	-24
	TD(1)	5000	10,000	0.03	0.1	-24
PIG	Discrete	515,000	3.6M	0	0	0
	ROUT	133	0.8M	0.09	0.14	-0.093
	TD(0) + explore	5 M	30 M	0.23	0.29	-0.151
	TD(0.8) + explore	9 M	60 M	0.23	0.33	-0.228
	TD(1) + explore	6 M	40 M	0.22	0.30	-0.264
	TD(0) no explore	8+ M	50+ M	0.12	0.54	-0.717
	TD(0.8) no explore	5 M	35 M	0.33	0.44	-0.308
	TD(1) no explore	5 M	30 M	0.23	0.32	-0.186
BAND	Discrete	736,281	4 M	0	0	0.682
	ROUT	30	15,850	0.01	0.05	0.668
	TD(0)	150,000	900,000	0.07	0.14	0.666
	TD(1)	100,000	600,000	0.02	0.04	0.669

Figure 9: Performance of Pig policies learned by TD and ROUT. ROUT's performance is marked by a single diamond at the top left of the graph.

The Pig results are charted in Table 1 and graphed in Figure 9. The graph shows the learning curves for the best single trial of each of six classes of runs: TD(0), TD(0.8) and TD(1), with and without exploration. (The vertical axis measures performance in expected points per game against the minimax optimal player, where +1 point is awarded for a win and -1 for a loss.) The best TD run, TD(0) with exploration, required about 30 million evaluations to reach its best performance of about -0.15. By contrast, ROUT completed successfully in under 1 million evaluations, and performed at the significantly higher level of -0.09. ROUT's adaptively-generated training set contained only 133 states.