Another simple exploration strategy is to take the action with the best estimated expected reward by default, but with probability p, choose an action at random. Some versions of this strategy start with a large value of p to encourage initial exploration, which is slowly decreased.
An objection to the simple strategy is that when it experiments with a non-greedy action it is no more likely to try a promising alternative than a clearly hopeless alternative. A slightly more sophisticated strategy is Boltzmann exploration. In this case, the expected reward for taking action a, ER(a) is used to choose an action probabilistically according to the distribution
The temperature parameter T can be decreased over time to decrease exploration. This method works well if the best action is well separated from the others, but suffers somewhat when the values of the actions are close. It may also converge unnecessarily slowly unless the temperature schedule is manually tuned with great care.