If the agent is going to be acting for a total of h steps,
it can use basic Bayesian reasoning to solve for an optimal
strategy [12]. This requires an assumed prior joint
distribution for the parameters , the most natural of which
is that each
is independently uniformly distributed between 0
and 1. We compute a mapping from belief states (summaries of
the agent's experiences during this run) to actions. Here, a belief
state can be represented as a tabulation of action choices and
payoffs:
denotes a
state of play in which each arm i has been pulled
times with
payoffs. We write
as the expected
payoff remaining, given that a total of h pulls are available, and
we use the remaining pulls optimally.
If , then there are no remaining pulls, and
. This is the basis of a recursive
definition. If we know the
value for all belief states with t
pulls remaining, we can compute the
value of any belief state
with t+1 pulls remaining:
where is the posterior subjective probability of action i
paying off given
,
and our prior probability. For the
uniform priors, which result in a beta distribution,
.
The expense of filling in the table of values in this way for
all attainable belief states is linear in the number of belief states
times actions, and thus exponential in the horizon.