# Chaos in learning a simple two-person game

## Abstract

We investigate the problem of learning to play the game of rock–paper–scissors. Each player attempts to improve her/his average score by adjusting the frequency of the three possible responses, using reinforcement learning. For the zero sum game the learning process displays Hamiltonian chaos. Thus, the learning trajectory can be simple or complex, depending on initial conditions. We also investigate the non-zero sum case and show that it can give rise to chaotic transients. This is, to our knowledge, the first demonstration of Hamiltonian chaos in learning a basic two-person game, extending earlier findings of chaotic attractors in dissipative systems. As we argue here, chaos provides an important self-consistency condition for determining when players will learn to behave as though they were fully rational. That chaos can occur in learning a simple game indicates one should use caution in assuming real people will learn to play a game according to a Nash equilibrium strategy.

### Sign up for PNAS alerts.

Get alerts for new articles, or get an alert when an article is cited.

## Learning in Games

Most work in game theory and economics involves the assumption of perfect rationality. In this case, it is natural to characterize a game in terms of its Nash equilibria, at which neither player can achieve better performance by modifying her/his strategy. Under the more realistic assumption that the players are only boundedly rational and must learn their strategies, everything becomes more complicated. Under-learning the strategies may fail to converge to a Nash equilibrium (1) or may even be chaotic (2). Thus, understanding the learning dynamics is essential (3). Here we give an example of an elementary two-person game in which a standard learning procedure leads to Hamiltonian chaos. This example extends earlier work finding chaotic attractors in dissipative game dynamics (4). We argue that chaos is a necessary condition for intelligent adaptive players to fail to converge to a Nash equilibrium (for some related work, see ref. 5).

A good example is the game of rock–paper–scissors: rock beats scissors, paper beats rock, scissors beats paper. With possible relabelings of the three possible moves, such as “earwig–man–elephant,” this ancient game is played throughout the world (6). To allow players to use their “skill,” it is often played repeatedly. In contrast, two-game theorists who practice what they preach would play with the skill-free Nash equilibrium mixed strategy, which is to choose the three possible moves randomly with equal probability. (In game theory, a mixed strategy is a random combination of the pure strategies—here, rock, paper, and scissors.) On average, the Nash equilibrium mixed strategy for rock–paper–scissors has the advantage that no strategy can beat it but it also has the disadvantage that there is no strategy that it can beat. An inspection of the World Rock–Paper–Scissors Society web site (http://www.worldrps.com/gbasics.html) suggests that members of this society do not play the Nash equilibrium strategy. Instead, they use psychology to try to anticipate the moves of the other player or particular sequences of moves to try to induce responses in the other player. At least for this game, it seems that real people do not learn to act like the rational agents studied in standard game theory.

A failure to converge to a Nash equilibrium under learning can happen, for example, because the dynamics of the trajectories of the evolving strategies in the space of possibilities are chaotic. Chaos has been observed in games with spatial interactions (7) or in games based on the single-population replicator equation (4, 8). In the latter examples, players are drawn from a single population and the game is repeated only in a statistical sense, i.e., the players' identities change in repeated trials of the game.

The example we present here demonstrates Hamiltonian chaos in a two-person game, in which each player learns her/his own strategy. We observe this for a zero-sum game, i.e., one in which one player's win is always the other's loss. The observation of chaos is particularly striking because of the simplicity of the game. Because of the zero-sum condition the learning dynamics have a conserved quantity with a Hamiltonian structure (9) similar to that of physical problems, such as celestial mechanics. There are no attractors, and trajectories do not approach the Nash equilibrium. Because of the Hamiltonian structure, the chaos is particularly complex, with chaotic orbits finely interwoven between regular orbits; for an arbitrary initial condition it is impossible to say

*a priori*which type of behavior will result. When the zero-sum condition is violated we observe other complicated dynamical behaviors, such as heteroclinic orbits with chaotic transients. As discussed in the conclusions, the presence of chaos is important because it implies that it is not trivial to anticipate the behavior of the other player. Thus, under chaotic learning dynamics even intelligent adaptive agents may fail to converge to a Nash equilibrium.## The Model of Learning Players

We investigate a game involving two players. At each move the first player chooses from one of where

*m*possible pure strategies (moves) with frequency**x**= (*x*_{1},*x*_{2}, …,*x*), and similarly the second player chooses from one of_{n}*n*possible pure strategies with frequency**y**= (*y*_{1},*y*_{2}, …,*y*). The players update_{m}**x**and**y**based on past experience by using reinforcement learning. Behaviors that have been successful are reinforced, and those that have been unsuccessful are repressed. In the continuous time limit where the change in**x**and**y**on any given time step goes to zero under some plausible assumptions, it is possible to show (11) that reinforcement learning dynamics are described by the coupled replicator equations (see*Notes*) of the form\[ \begin{equation*}\dot {x}_{i}=x_{i}[(Ay)_{i}-xAy](i=1,\hspace{.167em}{\ldots}\hspace{.167em},\hspace{.167em}n),\end{equation*}\]

[1]

\[ \begin{equation*}\dot {y}=y[(Bx)_{j}-yBx](i=1,\hspace{.167em}{\ldots}\hspace{.167em},\hspace{.167em}m),\end{equation*}\]

[2]

*A*and*B*are the payoff matrices for the first and second players, respectively.The relation of these equations to reinforcement learning is very intuitive. Consider the first equation, which describes the updating of the strategies of the first player: The frequency of strategy

*i*increases proportional to [current frequency (*x*)] times [average performance relative to the mean]. (_{i}*A***y**)_{i}is the performance of strategy*i*(averaged over the second player's possible moves), and**x***A***y**is the performance averaged over all*m*strategies of the first player. The second equation is similar.We investigate the dynamics of a generalized rock–paper–scissors game whose payoff matrices arewhere −1 < ɛ

\[ \begin{equation*}\hspace{1em}A= \left[ \begin{matrix}{\mathrm{{\varepsilon}_{x}}} \enskip & {\mathrm{-1}} \enskip & {\mathrm{1}} \enskip & \\ {\mathrm{1}} \enskip & {\mathrm{{\varepsilon}_{x}}} \enskip & {\mathrm{-1}} \enskip & \\ {\mathrm{-1}} \enskip & {\mathrm{1}} \enskip & {\mathrm{{\varepsilon}_{x}}} \enskip & \end{matrix} \right] {\mathrm{,\hspace{.167em}B=}} \left[ \begin{matrix}{\mathrm{{\varepsilon}_{y}}} \enskip & {\mathrm{-1}} \enskip & {\mathrm{1}} \enskip & \\ {\mathrm{1}} \enskip & {\mathrm{{\varepsilon}_{y}}} \enskip & {\mathrm{-1}} \enskip & \\ {\mathrm{-1}} \enskip & {\mathrm{1}} \enskip & {\mathrm{{\varepsilon}_{y}}} \enskip & \end{matrix} \right] {\mathrm{,}}\end{equation*}\]

[3]

_{x}< 1 and −1 < ɛ_{y}< 1 are the payoffs when there is a tie. We place these bounds on ɛ because when they are violated, the behavior under ties dominates, and this more closely resembles a matching-pennies-type game with three strategies. We have placed the columns in the order “rock,” “paper,” and “scissors.” For example, reading down the first column of**A**, in the case that the opponent plays “rock,” we see that the payoff for using the pure strategy “rock” is ɛ_{x}, “paper” is 1, and “scissors” is −1.The rock–paper–scissors game exemplifies a class of games where no strategy is dominant and no pure-strategy Nash equilibrium exists (any pure strategy is vulnerable to another). An example of a possible application is two broadcasting companies competing for the same time slot when preferences of the audience are context-dependent. Suppose, for example, that the audience prefers sports to news, news to drama, and drama to sports. If each broadcasting company must commit to their schedule without knowing that of their competitor, then the resulting game is of this type.

We consider the general case that a tie is not equivalent for both players, i.e., ɛ

_{x}≠ ɛ_{y}. In the example above, this symmetry would be true if the audience believes that within any given category one company's programming is superior to the other. If the size of the audience is fixed, so that one company's gain is the other's loss, this is a zero-sum game corresponding to the condition ɛ_{x}= −ɛ_{y}= ɛ. In general, Eqs. 1 and 2 form a conservative system, which cannot have an attractor. If in addition*A = −B*^{t}, it is known that the dynamics are Hamiltonian (9). This is a stronger condition, as it implies the full dynamical structure of classical mechanics, with pairwise conjugate coordinates obeying Liouville's theorem.## Dynamical Behavior of the System

To visualize the behavior of this system, in Fig. 1 we show Poincaré sections of Eqs. 1 and 2 with the initial conditions (

*x*_{1},*x*_{2},*x*_{3},*y*_{1},*y*_{2},*y*_{3}) = (0.5, 0.01*k*, 0.5 − 0.01*k*, 0.5, 0.25, 0.25) with*k*= 1, 2, … , 25. This is a sample of points where the trajectories intersect the hyperplane*x*_{2}−*x*_{1}+*y*_{2}−*y*_{1}= 0. When ɛ = 0, our simulation indicates that the system is integrable, and trajectories are confined to quasi-periodic tori. When ɛ > 0, however, this is no longer guaranteed. As we vary ɛ from 0 to 0.5 without changing initial conditions, some tori collapse and become chaotic, and the trajectories cover a larger region of the strategy space. Regular and chaotic trajectories are finely interwoven; for typical behavior of this type, there is a regular orbit arbitrarily close to any chaotic orbit (12).Figure 1

To demonstrate that these trajectories are indeed chaotic, we numerically compute Lyapunov exponents, which can be viewed as generalizations of eigenvalues that remain well defined for chaotic dynamics. Positive values indicate directions of average local exponential expansion, and negative values indicate local exponential contraction. Some examples are given in Table 1. The largest Lyapunov exponents are clearly positive for the first three initial conditions when ɛ = 0.25 and for the first four initial conditions when ɛ = 0.5. An indication of the accuracy of these computations can be obtained by comparing to known cases: because of the conservation condition the four exponents always sum to zero; because of the special nature of motion along trajectories plus the Hamiltonian condition the second and third are always zero; and when ɛ = 0, because the motion is integrable, all Lyapunov exponents are exactly zero.

Table 1

ɛ | λ | k = 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|

0 | λ_{1} | +1.0 | +1.4 | +0.4 | +0.4 | +0.4 |

λ_{2} | +0.2 | +0.3 | +0.3 | +0.3 | +0.3 | |

λ_{3} | −0.5 | −0.4 | −0.3 | −0.3 | −0.3 | |

λ_{4} | −0.7 | −1.2 | −0.4 | −0.4 | −0.4 | |

0.25 | λ_{1} | +49.0 | +35.3 | +16.6 | +0.4 | +0.4 |

λ_{2} | +0.3 | +0.3 | +0.4 | +0.2 | +0.3 | |

λ_{3} | −0.3 | −0.1 | −0.4 | −0.2 | −0.3 | |

λ_{4} | −49.0 | −35.5 | −16.5 | −0.4 | −0.4 | |

0.50 | λ_{1} | +61.6 | +35.0 | +28.1 | +12.1 | +0.2 |

λ_{2} | +0.6 | +0.3 | +0.1 | +0.0 | +0.2 | |

λ_{3} | −0.6 | −0.4 | −0.2 | −0.1 | −0.2 | |

λ_{4} | −61.5 | −35.8 | −28.0 | −12.2 | −0.3 |

*k*= 1, 2, 3, 4, 5 correspond to the initial conditions (

*x*

_{1},

*x*

_{2},

*x*

_{3},

*y*

_{1},

*y*

_{2},

*y*

_{3}) = (0.5, 0.01

*k*, 0.5 − 0.01

*k*, 0.5, 0.25, 0.25) with

*k*= 1, 2, … , 5. The Lyapunov exponents are multiplied by 10

^{3}. Note that λ

_{2}≃ 0.0, λ

_{3}≃ 0.0, and λ

_{4}≃ −λ

_{1}as expected. The Lyapunov exponents indicating chaos are shown in boldface.

As mentioned already, this game has a unique Nash equilibrium when all responses are equally likely, i.e.,

*x*\( \begin{equation*}{\mathrm{_{1}^{^{*}}}}\end{equation*}\) =*x*\( \begin{equation*}{\mathrm{_{2}^{^{*}}}}\end{equation*}\) =*x*\( \begin{equation*}{\mathrm{_{3}^{^{*}}}}\end{equation*}\) =*y*\( \begin{equation*}{\mathrm{_{1}^{^{*}}}}\end{equation*}\) =*y*\( \begin{equation*}{\mathrm{_{2}^{^{*}}}}\end{equation*}\) =*y*\( \begin{equation*}{\mathrm{_{3}^{^{*}}}}\end{equation*}\) = 1/3. It is possible to show that all trajectories have the same payoff as the Nash equilibrium on average (13). However, there are significant deviations from this payoff on any given step, which are larger than those of the Nash equilibrium. Thus, a risk averse agent would prefer the Nash equilibrium to a chaotic orbit.The behavior of the non-zero sum game is also interesting and unusual. When ɛ

_{x}+ ɛ_{y}< 0 (e.g., ɛ_{x}= −0.1, ɛ_{y}= 0.05), the motion approaches a heteroclinic cycle, as shown in Fig. 2.Figure 2

Players switch between pure strategies in the order

*rock*→*paper*→*scissors*. The time spent near each pure strategy increases linearly with time. This dynamics is in contrast to analogous behavior in the standard single-population replicator model, which increases exponentially with time. When ɛ_{x}+ ɛ_{y}> 0 (e.g., ɛ_{x}= 0.1, ɛ_{y}= −0.05), as shown in Fig. 3, the behavior is similar, except that the time spent near each pure strategy varies irregularly. The orbit is an infinitely persistent chaotic transient (14).Figure 3

## Chaos in Learning and Rationality

The emergence of chaos in learning in such a simple game illustrates that rationality may be an unrealistic approximation even in elementary settings. Chaos provides an important self-consistency condition. When the learning of her/his opponent is regular, any agent with even a crude ability to extrapolate can exploit this to improve performance. Nonchaotic learning trajectories are symptomatic that the learning algorithm is too crude to represent the behavior of a human agent. When the behavior is chaotic, however, extrapolation is difficult, even for intelligent humans. Hamiltonian chaos is particularly complex, because of the lack of attractors and the fine interweaving of regular and irregular motion. This situation is compounded for high dimensional chaotic behavior, because of the “curse of dimensionality” (15). In dimensions greater than about five, the amount of data an “econometric” agent would need to collect to build a reasonable model to extrapolate the learning behavior of her/his opponent becomes enormous. For games with more players it is possible to extend the replicator framework to systems of arbitrary dimension (Y.S. and J. P. Crutchfield, unpublished observations). It is striking that low dimensional chaos can occur even in a game as simple as the one we study here. The phase space of this game is four dimensional, which is the lowest dimension in which continuous dynamics can give rise to Hamiltonian chaos. In more complicated games in higher dimensional-state spaces we expect that chaos becomes even more common.

Many economists have noted the lack of any compelling account of how agents might learn to play a Nash equilibrium (16). Our results strongly reinforce this concern (see

*Notes*), in a game simple enough for children to play. That chaos can occur in learning such a simple game indicates that one should use caution in assuming that real people will learn to play a game according to a Nash equilibrium strategy.## Notes

### Coupled Replicator Equations.

Eqs. 1 and 2 have the same form as the multipopulation replicator equation (17) or the asymmetric game dynamics (18), which is a model of a two-population ecology. The difference from the standard single-population replicator equation (19) is in the cross term of the averaged performance. It has been known for some time that chaos occurs in single-population replicator equations (4, 8). This model is applicable to game theory in the specialized context where both players are forced to use the same strategy, for example, when two statistically identical players are repeatedly drawn from the same population. Here we study the context of actually playing a game, i.e., two fixed players who evolve their strategies independently.

### The Hamiltonian Structure.

To see the Hamiltonian structure of Eqs. 1 and 2 with (see ref. 9) where the Poisson structure We can transform to canonical coordinatesby applying the linear transformation to the Hamiltonian form (5).

**3**, it helps to transform coordinates. (**x**,**y**) exist in a six-dimensional space, constrained to a four-dimensional simplex because of the conditions that the set of probabilities**x**and**y**each sum to 1. For ɛ_{x}=−ɛ_{y}we can make a transformation from**U**= (**u**,**v**) in**R**^{4}with**u**= (*u*_{1},*u*_{2}) and**v**= (*v*_{1},*v*_{2}) such as*u*= log \( \begin{equation*}\frac{{x_{i+1}}}{{x_{1}}}\end{equation*}\),_{i}*v*_{i}= log \( \begin{equation*}\frac{{y_{i+1}}}{{y_{1}}}\end{equation*}\) (*i*= 1,2). The Hamiltonian is\[ \begin{equation*}H={\mathrm{-}}\frac{1}{3}(u_{1}+u_{2}+v_{1}+v_{2})\end{equation*}\]

[4]

\[ \begin{equation*}+{\mathrm{log}}(1+e^{u_{1}}+e^{u_{2}})(1+e^{v_{1}}+e^{v_{2}})\end{equation*}\]

\[ \begin{equation*}\dot {U}=J{\triangledown}_{U}H,\end{equation*}\]

[5]

*J*is here given as\[ \begin{equation*}J= \left[ \begin{matrix}{\mathrm{0}} \enskip & {\mathrm{0}} \enskip & {\mathrm{2{\varepsilon}}} \enskip & {\mathrm{3+{\varepsilon}}} \enskip & \\ {\mathrm{0}} \enskip & {\mathrm{0}} \enskip & {\mathrm{-}}3+{\varepsilon} \enskip & {\mathrm{2{\varepsilon}}} \enskip & \\ {\mathrm{-}}2{\varepsilon} \enskip & {\mathrm{3-{\varepsilon}}} \enskip & {\mathrm{0}} \enskip & {\mathrm{0}} \enskip & \\ {\mathrm{-}}3-{\varepsilon} \enskip & {\mathrm{-}}2{\varepsilon} \enskip & {\mathrm{0}} \enskip & {\mathrm{0}} \enskip & \end{matrix} \right] .\end{equation*}\]

[6]

\[ \begin{equation*}\dot {U}^{\prime}=S{\triangledown}_{U^{\prime}}H,\hspace{.167em}S= \left[ \begin{matrix}{\mathrm{O}} \enskip & {\mathrm{I}} \enskip & \\ {\mathrm{-}}I \enskip & {\mathrm{O}} \enskip & \end{matrix} \right] \hspace{.167em}\end{equation*}\]

[7]

**U′**= M**U**\[ \begin{equation*}M= \left[ \begin{matrix}{\mathrm{0}} \enskip & {\mathrm{0}} \enskip & {\mathrm{1}} \enskip & {\mathrm{0}} \enskip & \\ {\mathrm{0}} \enskip & {\mathrm{0}} \enskip & {\mathrm{0}} \enskip & {\mathrm{1}} \enskip & \\ {\mathrm{-}}\frac{2{\varepsilon}}{3({\varepsilon}^{2}+3)} \enskip & \frac{{\mathrm{{\varepsilon}+3}}}{{\mathrm{3}}({\mathrm{{\varepsilon}^{2}+3}})} \enskip & {\mathrm{0}} \enskip & {\mathrm{0}} \enskip & \\ \frac{{\mathrm{{\varepsilon}-3}}}{{\mathrm{3}}({\mathrm{{\varepsilon}^{2}+3}})} \enskip & {\mathrm{-}}\frac{2{\varepsilon}}{3({\varepsilon}^{2}+3)} \enskip & {\mathrm{0}} \enskip & {\mathrm{0}} \enskip & \end{matrix} \right] \end{equation*}\]

[8]

### Conjecture on Learning Dynamics.

When regular motion occurs, if one player suddenly acquires the ability to extrapolate and the other does not, the first player's score will improve. If both players can extrapolate, it is not clear what will happen. Our conjecture is that sufficiently sophisticated learning algorithms will result either in convergence to the Nash equilibrium or in chaotic dynamics. In the case of chaotic dynamics, it is impossible for players to improve their performance because trajectories become effectively unforecastable, and in the case of Nash equilibrium, it is also impossible by definition.

## Acknowledgments

We thank Sam Bowles, Jim Crutchfield, Mamoru Kaneko, Paolo Patelli, Cosma Shalizi, Spyros Skouras, Isa Spoonheim, Jun Tani, and Eduardo Boolean of the World Rock–Paper–Scissors Society for useful discussions. This work was supported by the Special Postdoctoral Researchers Program at The Institute of Physical and Chemical Research (Japan); and by grants from the McKinsey Corporation, Bob Maxfield, Credit Suisse, and Bill Miller.

## References

1

L Shapley

*Ann Math Studies***5**, 1–28 (1964).2

S Cowen

*Doctoral Dissertation*(Univ. of California, Berkeley, 1992).3

E Akiyama, K Kaneko

*Physica***D147**, 221–258 (2000).4

B Skyrms

*The Dynamics of Norms*, eds C Bicchieri, R Jeffrey, B Skyrms (Cambridge Univ. Press, Cambridge, U.K.), pp. 199–222 (1996).5

P Young, D P Foster

*Proc Natl Acad Sci USA***98**, 12848–12853 (2001).6

I Opie, P Opie

*Children's Games in Street and Playground*(Oxford Univ. Press, Oxford, 1969).7

M A Nowak, R M May

*Nature (London)***359**, 826–829 (1992).8

M A Nowak, K Sigmund

*Proc Natl Acad Sci USA***90**, 5091–5094 (1992).9

J Hofbauer

*J Math Biol***34**, 675–688 (1996).10

H Yoshida

*Phys Lett A***150**, 262–268 (1990).11

T Borgers, R Sarin

*J Econ Theory***77**, 1–14 (1997).12

A J Lichtenberg, M A Lieberman

*Regular and Stochastic Motion*(Springer, New York, 1983).13

P Schuster, K Sigmund, J Hofbauer, R Wolff

*Biol Cybern***40**, 1–8 (1981).14

T Chawanya

*Prog Theor Phys***94**, 163–179 (1995).15

J D Farmer, J J Sidorowich

*Phys Rev Lett***59**, 845–848 (1987).16

D M Kreps

*Game Theory and Economic Modelling*(Oxford Univ. Press, Oxford, 1990).17

P D Taylor

*J Appl Probability***16**, 76–83 (1979).18

J Hofbauer, K Sigmund

*The Theory of Evolution and Dynamical Systems*(Cambridge Univ. Press, Cambridge, U.K., 1988).19

P D Taylor, L B Jonker

*Math Biosci***40**, 145–156 (1978).## Information & Authors

### Information

#### Published in

#### Classifications

#### Copyright

Copyright © 2002, The National Academy of Sciences.

#### Submission history

**Received**: December 3, 2001

**Accepted**: February 12, 2002

**Published online**: April 2, 2002

**Published in issue**: April 2, 2002

#### Acknowledgments

We thank Sam Bowles, Jim Crutchfield, Mamoru Kaneko, Paolo Patelli, Cosma Shalizi, Spyros Skouras, Isa Spoonheim, Jun Tani, and Eduardo Boolean of the World Rock–Paper–Scissors Society for useful discussions. This work was supported by the Special Postdoctoral Researchers Program at The Institute of Physical and Chemical Research (Japan); and by grants from the McKinsey Corporation, Bob Maxfield, Credit Suisse, and Bill Miller.

### Authors

## Metrics & Citations

### Metrics

#### Citation statements

#### Altmetrics

### Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

#### Cited by

Loading...

## View Options

### View options

#### PDF format

Download this article as a PDF file

DOWNLOAD PDF### Get Access

#### Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Personal login Institutional Login#### Recommend to a librarian

Recommend PNAS to a Librarian#### Purchase options

Purchase this article to get full access to it.