When there are multiple states, but reinforcement is still immediate, then any of the above solutions can be replicated, once for each state. However, when generalization is required, these solutions must be integrated with generalization methods (see section 6); this is straightforward for the simple ad-hoc methods, but it is not understood how to maintain theoretical guarantees.
Many of these techniques focus on converging to some regime in which exploratory actions are taken rarely or never; this is appropriate when the environment is stationary. However, when the environment is non-stationary, exploration must continue to take place, in order to notice changes in the world. Again, the more ad-hoc techniques can be modified to deal with this in a plausible manner (keep temperature parameters from going to 0; decay the statistics in interval estimation), but none of the theoretically guaranteed methods can be applied.