Exploration is often more efficient when it is based on second-order
information about the certainty or variance of the estimated values of
actions. Kaelbling's interval estimation
algorithm [52] stores statistics for each action :
is the number of successes and
the number of trials. An
action is chosen by computing the upper bound of a
confidence interval on the success probability
of each action and choosing the action with the highest upper bound.
Smaller values of the
parameter encourage greater
exploration. When payoffs are boolean, the normal approximation to
the binomial distribution can be used to construct the confidence
interval (though the binomial should be used for small n). Other
payoff distributions can be handled using their associated statistics
or with nonparametric methods. The method works very well in
empirical trials. It is also related to a certain class of statistical
techniques known as experiment design methods [17],
which are used for comparing multiple treatments (for example,
fertilizers or drugs) to determine which treatment (if any) is best in
as small a set of experiments as possible.