Given these assumptions, if there are only a finite number of actions to choose from on each turn, then there will be a unique policy that is optimal. In games where the players take turns, this policy will be deterministic: it is optimal for the move to be a function of the state. This is called a "pure strategy". In games where the players move simultaneously there are sometimes only nondeterministic policies that are optimal. In this case, an optimal player will randomly choose an action on each turn, where the probabilities are a function of the state. This is called a "mixed strategy".
In either case, "optimal" is defined the same. If both players use optimal policies, then the average discounted reinforcement received is called the "value" of the game. If a player uses an optimal strategy and the opponent uses any other policy, then the player using the optimal policy will either achieve the value on average or will do even better. So when two optimal players play each other, neither will ever have any reason to change policies, even if the other player's policy is known. When playing against a very bad player, the "optimal" policy may not give the best reinforcement, since it overestimates the opponent's abilities, and mail fail to take advantage of the opponent's ignorance in the best way.
In games with an infinite number of actions, there may not be an optimal policy, even if there is only one state. With a finite number of actions, there is always an optimal policy.
Reinforcement learning with games is much like reinforcement learning with ordinary MDPs. For example, if doing Q-learning on a game where the players take turns, then the algorithm is exactly like Q-learning, except that the max becomes a min in the equation when it is a certain player's turn. If the players don't take turns but a deterministic policy is optimal, then the expression
max_over_u [ Q(x,u) ]is replaced with
max_over_u1 [ min_over_u2 [ Q(x,u1,u2) ] ]in the Q-learning algorithm. In both cases, the expression gives the value of state x, which is then used to train a preceeding state. In the case where the optimal policy is stochastic, the value of a state is harder to calculate. First, Q(x,u1,u2) is evaluated for the state x and all possible (u1,u2) pairs. The results are put in a matrix. This is then treated as the payoff matrix for a single-turn game, and the value of that game is solved using linear programming. The result gives the value of state x, and the probabilities for the optimal policy for both players in state x.