Game Theory

Reinforcement learning often deals with two-player, zero-sum games with perfect information. This is a game where two players either take turns, or they make each move simultaneously without knowing what the other player will do. The result of each move (or pair of moves) is a transition to a new state of the game. The transition may be deterministic or stochastic, with the probabilities being only a function of the current state and the actions chosen by the players. At the end of the game, a reinforcement is given which is a function of the final state. That reinforcement may be just zero or one for games where the goal is simply to "win". One player tries to maximize that final score and one player tries to minimize. It is also possible for there to be a reinforcement on each turn, with one player maximizing total discounted reinforcement and one player minimizing. The reinforcement is a function of the current state, and the actions chosen by each player.

Given these assumptions, if there are only a finite number of actions to choose from on each turn, then there will be a unique policy that is optimal. In games where the players take turns, this policy will be deterministic: it is optimal for the move to be a function of the state. This is called a "pure strategy". In games where the players move simultaneously there are sometimes only nondeterministic policies that are optimal. In this case, an optimal player will randomly choose an action on each turn, where the probabilities are a function of the state. This is called a "mixed strategy".

In either case, "optimal" is defined the same. If both players use optimal policies, then the average discounted reinforcement received is called the "value" of the game. If a player uses an optimal strategy and the opponent uses any other policy, then the player using the optimal policy will either achieve the value on average or will do even better. So when two optimal players play each other, neither will ever have any reason to change policies, even if the other player's policy is known. When playing against a very bad player, the "optimal" policy may not give the best reinforcement, since it overestimates the opponent's abilities, and mail fail to take advantage of the opponent's ignorance in the best way.

In games with an infinite number of actions, there may not be an optimal policy, even if there is only one state. With a finite number of actions, there is always an optimal policy.

Reinforcement learning with games is much like reinforcement learning with ordinary MDPs. For example, if doing Q-learning on a game where the players take turns, then the algorithm is exactly like Q-learning, except that the max becomes a min in the equation when it is a certain player's turn. If the players don't take turns but a deterministic policy is optimal, then the expression

    max_over_u [ Q(x,u) ]

is replaced with

    max_over_u1 [ min_over_u2 [ Q(x,u1,u2) ] ]

in the Q-learning algorithm. In both cases, the expression gives the value of state x, which is then used to train a preceeding state. In the case where the optimal policy is stochastic, the value of a state is harder to calculate. First, Q(x,u1,u2) is evaluated for the state x and all possible (u1,u2) pairs. The results are put in a matrix. This is then treated as the payoff matrix for a single-turn game, and the value of that game is solved using linear programming. The result gives the value of state x, and the probabilities for the optimal policy for both players in state x.

More Information

Back to Glossary Index