Our third test problem is to compute the optimal policy for a finite-horizon k-armed bandit [Berry and Fristedt1985]. While an optimal solution in the infinite-horizon case can be found efficiently using Gittins indices, solving the finite-horizon problem is equivalent to solving a large acyclic, stochastic MDP in belief space [Berry and Fristedt1985]. We show results for k=3 arms and a horizon of n=25 pulls, where the resulting MDP has 736,281 states. Solving this MDP by DAG-SP produces the optimal exploration policy, which has an expected reward of 0.6821 per pull.
We encoded each state as a six-dimensional feature vector of and attempted to learn a neural network approximation to with TD(0), TD(1), and ROUT. Again, the parameters for all algorithms were tuned by hand.
The results are shown in Table 1. All methods do spectacularly well, although the TD methods again require more trajectories and more evaluations. Careful inspection of the problem reveals that a globally linear value function, extrapolated from the states close to the end, has low Bellman residual and performs very nearly optimally. Both ROUT and TD successfully exploit this linearity.