Here are the ROUT algorithm and HUNTFRONTIERSTATE subroutine, as described in Section 2.3.
ROUT(start states , fitter F): |
/* Assumes that the world model MDP is known and acyclic. */ |
initialize training set , and F := an arbitrary fit; |
repeat: |
for each start state not yet marked ``done'', do: |
s := HUNTFRONTIERSTATE(x, F); |
add s to training set S and re-train fitter F on S; |
if (s = x), then mark start state x as ``done''. |
until all start states in are marked ``done''. |
HUNTFRONTIERSTATE(state x, fit F): |
/* |
for each legal action , do: |
repeat up to H times: |
generate a trajectory from x to termination, starting with action a; |
let y be the last state on with Bellman residual ; |
if and , then break out of loops, and |
restart procedure with HUNTFRONTIERSTATE(y, F). |
/* reaching this point, x's subtree is deemed all self-consistent and correct! */ |
return x. |