Win–stay, lose–shift in language learning from peers

  1. Frederick A. Matsen* and
  2. Martin A. Nowak
  1. Program for Evolutionary Dynamics, Departments of Mathematics and Organismic and Evolutionary Biology, Harvard University, One Brattle Square, Cambridge, MA 02138
  1. Edited by Kenneth W. Wachter, University of California, Berkeley, CA, and approved November 12, 2004 (received for review September 7, 2004)

Abstract

Traditional language learning theory explores an idealized interaction between a teacher and a learner. The teacher provides sentences from a language, while the learner has to infer the underlying grammar. Here, we study a new approach by considering a population of individuals that learn from each other. There is no designated teacher. We are inspired by the observation that children grow up to speak the language of their peers, not of their parents. Our goal is to characterize learning strategies that generate “linguistic coherence,” which means that most individuals use the same language. We model the resulting learning dynamics as a random walk of a population on a graph. Each vertex represents a candidate language. We find that a simple strategy using a certain aspiration level with the principle of win–stay, lose–shift does extremely well: stay with your current language, if at least three others use that language; otherwise, shift to an adjacent language on the graph. This strategy guarantees linguistic coherence on all nearly regular graphs, in the relevant limit where the number of candidate languages is much greater than the population size. Moreover, for many graphs, it is sufficient to have an aspiration level demanding only two other individuals to use the same language.

Consider a population of individuals dispersed over a graph (Fig. 1). The task of these individuals is to find each other in an arbitrary node of the graph. In one time step, each individual can remain where it is or move to an adjacent node. The individuals have no global information about the graph or the configuration of the population. They cannot see each other at a distance. They can only determine whether other individuals are in the same node as themselves or in a different node. Which strategy could lead to success?

Fig. 1.

The basic operation of the model. The figure shows a possible sequence of events for a hypercube language space of 16 nodes (languages) and six aspiration learners using the k = 2 strategy. In the first frame, all individuals are using different languages. The black dots indicate that the node is occupied, and the number gives how many agents are occupying that node. The next frame shows the next time step: Because there are no languages with strictly more than two individuals at that language, each individual takes a random step along an edge of the graph. In the third frame, enough time has passed for three individuals to arrive at the same node. They will not move for the remainder of the simulation. The last frame shows the last three individuals collecting at this node. It is possible that these individuals could form a second cluster, but our analysis shows that under certain conditions a second cluster is unlikely when the graph is large.


This question about the random walk of a population on a graph is motivated by the problem of language acquisition. Learning theory is a fascinating field at the interface of mathematics, computer science, and formal linguistics (17). The original studies (8) were motivated by Chomsky's claim (9) that any given learning algorithm can only succeed on a limited set of grammars. Consider a teacher and a learner. The teacher provides sample sentences from a language. The learner tries to infer the underlying grammatical rules. It can be shown formally that no learner can solve this task if he entertains an unlimited set of hypotheses. In other words, each particular learning strategy can only learn a limited set of languages. This impossibility of tabula rasa learning has been shown in the Gold framework (1, 8), in statistical learning theory (2), in learning with queries (10), and in the framework of “probably almost correct” (4).

As outlined above, the traditional setting of formal language learning is an idealized teacher–student pair. In contrast, this paper considers a population of identical agents that are trying to find a common language. This approach is motivated by the fact that in reality children learn many aspects of language neither from their parents nor from other members of the previous generation, but from each other (11). A remarkable example of this is the development of Nicaraguan Sign Language, where a group of students developed a full language in a short span of time (12, 13). We would like to model this process and find conditions for the population to develop linguistic coherence, which means that all of the individuals end up with the same language. Convergence to linguistic coherence in a population has been studied before in evolutionary and nonevolutionary settings (1420).

Typically, in the language learning literature, one fixes a type of “learner” with certain capabilities and then establishes the class of languages this agent can learn under certain conditions. The learner in this paper is a type of “memoryless learner.” The memoryless learner has a finite set of possible grammars. Initially, the learner chooses one of these grammars as a hypothesis. The learner receives sentences. As long as the sentences are sufficiently compatible, the learner will remain with the current hypothesis. If a sentence is not compatible, the learner will jump to another hypothesis. The algorithm is memoryless, because the learner keeps no record and may return to a hypothesis that was previously rejected.

For such a learner, we need to decide on some sort of structure for the set of grammars: Which “jumps” are legal? A main concern of linguistic or cognitive research is to identify the learning algorithm that is used by human children and the underlying structure of candidate languages (universal grammar). The approach here is simple but general: we think of the set of possible grammars as a graph, where the nodes represent languages and the edges represent possible “jumps” from one candidate language to another. For example, if the set of languages is given by the complete graph, where all vertices are connected, then the learner can move from any one hypothesis to any other. If, on the other hand, the graph of languages is represented by a cycle, then an individual can move from its present language only to one of the two adjacent languages. The hypercube is a graph that lies between these two extreme cases: It has more edges than the cycle and fewer edges than the complete graph.

There is precedent for thinking of the set of languages as a graph in the “trigger learning algorithm” (21). In this framework, the set of grammatical hypotheses is generated by k binary parameters (22). At any one time, the learner holds a particular hypothesis given by a particular setting of these parameters. The learner does not change the hypothesis as long as the sentences are compatible. If a sentence arrives that is not compatible, then the learner will change one of the k parameters. In this model, the set of languages can be thought of being represented by the vertices of a k-dimensional hypercube. The process of language acquisition is described by the learner moving along the edges of this hypercube graph.

Let us consider N individuals initially randomly distributed on a language graph with n languages (nodes). Naturally, we assume n » N: There are many more possible candidate languages than people. At each time step, each individual is allowed to update its language. To do so, the individual can attempt to communicate with other individuals. The only information that can be inferred, however, is whether or not other individuals use the same language as the given individual. If an individual decides to change its current language, it can move to any adjacent language. In one version of the model, all individuals are updated synchronously. In another version, one individual is updated at each time step.

The goal is to find a single strategy that ensures (with high probability) that all individuals use the same language in the end. This state is called linguistic coherence. Thus, we want to find a mechanism that governs the random walk of a population on a graph with the aim that all individuals settle in the same place with high probability after a reasonably short time.

A natural strategy is the following: An individual stays with a certain language if some fixed number of other individuals use that same language. We call such agents “aspiration learners”: The learner aspires to use a language that is shared by a minimum number of other people. Because communication can be seen as an evolutionary game (14), the individual aspires to at least a certain level of payoff. In this case, payoff is coming from communicative success. The individual is happy when this goal is reached and stays with the current language. The individual shifts to another language when the goal is not reached. Thus, the strategy is one of win–stay, lose–shift (23).

Win–stay, lose–shift is a successful strategy in many different contexts. For example, it can outperform tit-for-tat in the repeated Prisoner's Dilemma (23, 24). Strategies using this principle do well in noisy situations and even when a 2 × 2 game is randomly assigned to the players (25). It is a fundamental and simple rule of animal behavior in numerous situations (26, 27). The aspiration level determines what the agent considers a win versus a loss. The crucial question is what is the most successful aspiration level in a given situation (28). This is also the main question that we want to address here.

In our context, the aspiration level, k, is given by the minimum number of people with the same language such that the individual will stay with this language. Once this threshold is met or exceeded, the agents form a “cluster” that does not move anymore. For example, an aspiration learner with k = 1 will keep its language if at least one other person uses this language. For a learner with k = 2, at least two other individuals are required. An obvious guarantee for success is if every individual uses an aspiration level, k, given by any integer ≥N/2. In this case, all individuals will continue to move until more than half of them use the same language. Then all others will converge to the same cluster. This strategy, however, will take a very long time to form the first cluster. Strategies with a lower aspiration level are faster to form the first cluster, but they admit the possibility that several independent clusters might be formed, which prevents convergence to complete coherence.

For this paper, we consider only memoryless learners whose set of possible grammars forms a “nearly regular” graph. If we define the “local degree” of a node to be the number of edges touching that node, we call a graph nearly regular when the maximum local degree is close to the minimum local degree. A regular graph is one in which these two numbers are equal, so all regular graphs are nearly regular. The hypercube, the complete graph, and the cycle are examples of regular graphs. Binary trees are examples of what we call nearly regular graphs. These ideas are formalized in Appendix: Proofs.

We found the following surprising result: If there are many more candidate languages than people (n » N), then k = 3 will almost always converge to complete coherence on any nearly regular graph. Moreover, for many nearly regular graphs, such as the hypercube, the complete graph, the balanced d-tree, or the torus of at least two dimensions, even k = 2 leads to complete coherence. Thus, even though each agent only demands to be able to communicate with a few others, in most cases the dynamics will lead to everyone speaking the same language. The proof of these statements are given in Appendix: Proofs.

Note that some structural assumption is needed on the class of graphs under consideration. For example, if we consider two star graphs connected by an edge, we will need a very large k to gain coherence. Our “nearly regular” assumption excludes these sorts of graphs.

We can estimate the time it takes to reach coherence. This time is the sum of two terms. The first term approximates how long it takes to form the first cluster. It is approximately the inverse of n times the probability that a cluster of size k + 1 forms at a given node, assuming that we begin with a uniform distribution. The second term describes the amount of time it takes for all individuals to be absorbed into this cluster. This time depends explicitly on the graph structure. For the complete graph we estimate the total time to coherence as Formula This is an excellent estimate, as shown in Fig. 2. Note that in the case k ≥ 2 the first term dominates, and that for large n this term is Formula. Therefore, although we may get a higher probability of complete coherence with a larger k, the individuals will take a much longer time to converge when n is large.

Fig. 2.

The convergence rates of different k values with a fixed N on the complete graph. The line represents the value given by Eq. 1, and the marks represent data from simulations. One can see that the time is approximately proportional to nk.


In Fig. 3, we show simulation results. We define the level of coherence as the probability that two randomly (with replacement) chosen individuals speak the same language (16, 29). The population size is n = 20. The expected coherence (averaged over 2,000 or more runs) is shown as a function of time. Fig. 3a shows the performance of the aspiration level k = 2 for complete graphs of sizes n = 32, 128, 512, and 2,048. Similarly, Fig. 3b shows the performance of k = 2 on hypercubes of four different sizes. Fig. 3 c and d shows the performance of k = 2 and k = 3 on cycle graphs. For large complete graphs and hypercubes, k = 2 is sufficient for the average coherence to converge to one as n increases. For cycles, k = 2 is not sufficient, but k = 3 is required.

Fig. 3.

Results from simulations of the model with a population size of n = 20. In a and b, we clearly see that k = 2 leads to coherence with high probability when n is large for both the complete graph and the hypercube. The curves follow a sigmoidal shape, which comes from the two time scales involved in the analysis. c shows that on the cycle, k = 2 does not lead to complete coherence with high probability. d shows that one gets the coherence result when k = 3 as predicted by the theory.


The sigmoidal shape of the curves suggests that, on average, one has to wait a long time for the first cluster to form, but once this cluster has appeared coherence is reached relatively quickly. Eq. 1 describes these two time scales in the case of the complete graph.

In summary, we have modeled language learning from peers as the random walk of a population on a graph. The set of learnable languages is given by a graph, where each node represents a language (grammar) and each edge indicates a possible change of hypothesis during the learning process. We consider a memoryless learner with an aspiration level. The learner is satisfied if k others use the same language. In this case, the learner will stay with its current hypothesis. If less than k others use this language, then the learner will jump to a new hypothesis. We show that k = 3 guarantees success with high probability on all nearly regular graphs. Moreover, on many of those graphs, k = 2 is sufficient and leads to coherence more quickly. Thus, a simple strategy of win–stay, lose–shift is highly successful in language learning from peers. Our approach has applications beyond language acquisition to learning in other social or game theoretic settings (30).

Acknowledgments

We thank Igor Pak and Lorens Imhof. The Program for Evolutionary Dynamics is supported by Jeffrey Epstein.

Footnotes

  • * To whom correspondence should be addressed. E-mail: matsen{at}math.harvard.edu.

  • This paper was submitted directly (Track II) to the PNAS office.

  • See Commentary on page 17885.

References

« Previous | Next Article »Table of Contents
From the Cover