Here are the ROUT algorithm and HUNTFRONTIERSTATE subroutine, as described in Section 2.3.
ROUT(start states ![]() |
/* Assumes that the world model MDP is known and acyclic. */ |
initialize training set ![]() |
repeat: |
for each start state ![]() |
s := HUNTFRONTIERSTATE(x, F); |
add ![]() ![]() |
if (s = x), then mark start state x as ``done''. |
until all start states in ![]() |
HUNTFRONTIERSTATE(state x, fit F): |
/* ![]() |
for each legal action ![]() |
repeat up to H times: |
generate a trajectory ![]() |
let y be the last state on ![]() ![]() |
if ![]() ![]() |
restart procedure with HUNTFRONTIERSTATE(y, F). |
/* reaching this point, x's subtree is deemed all self-consistent and correct! */ |
return x. |