Skip to main content

Main menu

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home
  • Log in
  • My Cart

Advanced Search

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
Commentary

Rethinking language: How probabilities shape the words we use

Thomas L. Griffiths
  1. Department of Psychology, University of California, Berkeley, CA 94720-1650

See allHide authors and affiliations

PNAS March 8, 2011 108 (10) 3825-3826; https://doi.org/10.1073/pnas.1100760108
Thomas L. Griffiths
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: tom_griffiths@berkeley.edu
  • Article
  • Info & Metrics
  • PDF
Loading

If you think about the classes you expect to take when studying linguistics in graduate school, probability theory is unlikely to be on the list. However, recent work in linguistics and cognitive science has begun to show that probability theory, combined with the methods of computer science and statistics, is surprisingly effective in explaining aspects of how people produce and interpret sentences (1–3), how language might be learned (4–6), and how words change over time (7, 8). The paper by Piantadosi et al. (9) that appears in PNAS adds to this literature, using probabilistic models estimated from large databases to update a classic result about the length of words.

Quantitative Analysis of Language

The classic finding that Piantadosi et al. (9) revisit is Zipf's observation that the length of words is inversely related to their frequency: Words that are used often, such as “the,” tend to be short (10). This was one of several results obtained through quantitative analysis of the statistics of language, of which perhaps the most famous is the power-law distribution of word frequencies (known as “Zipf's Law”). Zipf explained these regularities by appealing to a “Principle of Least Effort” (11), which is sufficiently provocative as to have made its way into Pynchon's Gravity's Rainbow (12). For the relationship between length and frequency, the idea is that producing longer words requires more effort, so languages should be structured to use such words infrequently. This work has been followed by detailed quantitative studies of the distributions of word frequencies and word lengths (13, 14).

Zipf's analyses were done at a time when mathematical ideas were beginning to be applied to language, including probability theory. Three decades earlier, Markov introduced the idea of modeling a sequence of random variables by assuming that each variable depends only on the preceding variable (a Markov chain) using the example of modeling sequences of letters (15). A simple probabilistic model for a sequence of letters might be to choose each letter independently, with probability proportional to its frequency in the language, like drawing a set of tiles in Scrabble. Unfortunately, as Scrabble players know all too well, putting down these tiles in sequence is unlikely to make a word in English. Imagine if you took tiles from a bag where the probabilities were determined by how often each letter followed the last letter you drew—no more nasty sequences of all vowels or all consonants! A decade after Zipf published his analyses, Shannon (16) used these Markov chains to predict sequences of words, observing that a reasonable approximation to English could be produced if each word was chosen based not just on the previous word but on the last few words, and introduced a mathematical framework for analyzing the information provided by

The length of words is not just related to their frequency but to their predictability in context.

a word. In this framework, the informativeness of a word is given by the negative logarithm of its probability, matching the intuition that less probable words carry more information.

Research applying probability theory to language slowed with the rise of Chomskyan linguistics. Chomsky (17) argued convincingly against the ability of Markov chains to capture the structure of sentences. His famous sentence “Colorless green ideas sleep furiously” was constructed, in part, to illustrate that Markov chains cannot be used to determine whether a sentence is grammatical: Each pair of words in this sequence is unlikely to occur together, so its probability should be near zero even though most speakers of English would agree it is grammatical (if a little unusual). This argument against a specific probabilistic model was taken to refute more generally the relevance of probability theory to understanding language, with formal linguistics turning to a mathematical framework that had more in common with logic.

The return of probability theory to linguistics came via work on the more applied problem of making computers process human languages. Probability theory can be used to solve two kinds of problems: making predictions and making inferences. Both are relevant to processing language. If you want to do a good job of interpreting human speech, it helps to have a good model of which sequences of words you are likely to hear. Understanding sentences and learning language are both problems of inductive inference, requiring a leap that goes from the words we hear to an underlying structure, and probability theory (and particularly Bayesian inference) can be used to solve these problems. Computational linguists discovered that ideas from probability theory could improve algorithms for speech recognition (18), identifying the roles that words play in sentences (19) and inferring the structure of those sentences (20), and it is now difficult to understand most papers at a computational linguistics conference without a good education in statistics.

The Rise of Probability

Probability theory has begun to migrate from computational linguistics into other areas of language research. The problems posed by colorless green ideas can be circumvented by using more sophisticated probabilistic models than Markov chains (21), and theorists are beginning to ask whether probabilities appear in linguistic representations (22, 23). Psycholinguists have begun to examine how the predictability of a word influences its production and processing (1–3). Language learning researchers have used probability theory as the basis for theoretical arguments that language can be learned (4), as well as in experiments and models exploring the acquisition of its components (5, 6). Research on how languages change over time now has access to reconstructions of the relationships between languages (and the words themselves) produced using probability theory (7, 8). Supporting these probabilistic models is the availability of large amounts of linguistic data, through databases that are larger and easier to access than ever before.

Piantadosi et al. (9) draw on these resources to conduct a deeper analysis of the factors influencing the length of words. Their basic empirical result is a nice extension of Zipf's original observation, showing that the length of words is not just related to their frequency but to their predictability in context. By considering the frequency of a word, Zipf measured how predictable that word is if you know nothing else about the words you are likely to encounter. However, Markov chains can be used to compute how probable each instance of a word is based on the last few words, providing a way to measure the predictability of a word in its context. This makes it possible to calculate how much information is contributed by that word, using the metric introduced by Shannon (16). If a word is easy to predict based on context, it contributes little information. Piantadosi et al. (9) find that the average information contributed by a word is better correlated with its length than is its overall frequency, suggesting that the predictability of a word in context is what matters in determining how long that word should be.

This refinement of our understanding of the relationship between the length of a word and its probability is bolstered by a theoretical framework that adds precision to Zipf's Principle of Least Effort and connects the relationship between word length and probability to an idea that has already proven valuable in other areas of psycholinguistics. This framework is based on the “Uniform Information Density” hypothesis: the idea that human languages follow the optimal strategy for communicating information through a noisy channel, by transmitting information at a constant rate that matches the capacity of the channel (2, 24–27). A crude analogy might be to imagine communication in terms of pumping oil along a fragile pipe. If you pump too slowly, it takes too long; pumping too quickly risks breaking the pipe; and varying the rate of flow is either inefficient or dangerous. The best strategy is to pump at a constant level set by the capacity of the pipe. In the case of language, we are pumping words at one another; the time it takes to send a word along the pipe is determined by its length, and the capacity of the pipe is determined by the rate at which we can process linguistic information. The best solution is to send information at a constant rate, which means that less predictable words, those that carry more information, should be longer.

The Uniform Information Density hypothesis shares with the Principle of Least Effort the notion of optimization, making the most of a limited resource, but gives this notion a formal precision that leads to a variety of other interesting predictions. For example, including an additional unnecessary word, such as “that,” in a sentence (e.g., “How big is the family that you cook for?”) potentially dilutes the information density of the sentence (specifically, the information associated with the clause beginning with “you”). The information density will thus become more uniform if such words are introduced to sentences that carry more information, and people's word choices seem to follow this prediction (2). Explanations framed in terms of information density rather than least effort also make it clearer that we should imagine language as being tailored to fit human minds rather than human laziness.

Providing a formal framework connecting word length and predictability opens the door to further analyses using more sophisticated probabilistic models, and considering other statistics that might be relevant to understanding the lengths of words. There is still a great deal of variance in the length of words that is not explained by their predictability. However, the deeper message behind the results of Piantadosi et al. (9) is that probability and information theory can help us rethink the way that language works, and how it should be studied. Probabilities can augment the classic rule-based representations that are widely used in linguistics, and information theory provides a way to formalize ideas like the Principle of Least Effort in a way that leads to unique predictions about language. Conversely, perhaps judgment should be reserved until Uniform Information Density makes its own appearance in literary fiction.

Footnotes

  • Author contributions: T.L.G. wrote the paper.

  • The author declares no conflict of interest.

  • See companion article on page 3526 in issue 9 of volume 108.

References

  1. ↵
    1. Hale J
    (2001) Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics, A probabilistic Earley parser as a psycholinguistic model (Association for Computational Linguistics, Stroudsburg, PA), pp 159–166.
  2. ↵
    1. Scholkopf B,
    2. Platt J,
    3. Hoffman T
    1. Levy R,
    2. Jaeger TF
    (2007) in Adv Neural Inf Process Syst, Speakers optimize information density through syntactic reduction, eds Scholkopf B, Platt J, Hoffman T (MIT Press, Cambridge, MA), 19, pp 849–856.
    OpenUrl
  3. ↵
    1. Padó U,
    2. Crocker MW,
    3. Keller F
    (2009) A probabilistic model of semantic plausibility in sentence processing. Cogn Sci 33:794–838.
    OpenUrlCrossRefPubMed
  4. ↵
    1. Chater N,
    2. Vitányi P
    (2007) Ideal learning of natural language: Positive results about learning from positive evidence. J Math Psychol 51:135–163.
    OpenUrlCrossRef
  5. ↵
    1. Saffran JR,
    2. Aslin RN,
    3. Newport EL
    (1996) Statistical learning by 8-month-old infants. Science 274:1926–1928.
    OpenUrlAbstract/FREE Full Text
  6. ↵
    1. Xu F,
    2. Tenenbaum JB
    (2007) Word learning as Bayesian inference. Psychol Rev 114:245–272.
    OpenUrlCrossRefPubMed
  7. ↵
    1. Gray RD,
    2. Atkinson QD
    (2003) Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426:435–439.
    OpenUrlCrossRefPubMed
  8. ↵
    1. Bouchard-Côté A,
    2. Griffiths TL,
    3. Klein D
    (2009) Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL09) Improved reconstruction of protolanguage word forms (Association for Computational Linguistics, Stroudsburg, PA), pp 65–73.
  9. ↵
    1. Piantadosi ST,
    2. Tily H,
    3. Gibson E
    (2011) Word lengths are optimized for efficient communication. Proc Natl Acad Sci USA 108:3526–3529.
    OpenUrlAbstract/FREE Full Text
  10. ↵
    1. Zipf G
    (1936) The Psychobiology of Language (Routledge, London).
  11. ↵
    1. Zipf G
    (1949) Human Behavior and the Principle of Least Effort (Addison-Wesley, New York).
  12. ↵
    1. Pynchon T
    (1973) Gravity's Rainbow (Viking, New York).
  13. ↵
    1. Grzybek P
    (2007) Contributions to the Science of Text and Language: Word Length Studies and Related Issues (Springer, Dordrecht, The Netherlands).
  14. ↵
    1. Baayen H
    (2002) Word Frequency Distributions (Kluwer, Dordrecht, The Netherlands).
  15. ↵
    1. Markov AA
    (1913) An example of statistical investigation of the text Eugene Onegin concerning the connection of samples in chains; trans. Custance G, Link D (2006) Science in Context 19:591–600, (Russian).
    OpenUrl
  16. ↵
    1. Shannon CE
    (1948) A mathematical theory of communication. Bell System Technical Journal 27:379–423, 623–656.
    OpenUrl
  17. ↵
    1. Chomsky N
    (1957) Syntactic Structures (Mouton, The Hague).
  18. ↵
    1. Rabiner LR
    (1989) A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc IEEE 77:257–286.
    OpenUrlCrossRef
  19. ↵
    1. Merialdo B
    (1994) Tagging English text with a probabilistic model. Comput Linguist 20:155–172.
    OpenUrl
  20. ↵
    1. Collins M
    (1996) Proceedings of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics, A new statistical parser based on bigram lexical dependencies (Association for Computational Linguistics, Stroudsburg, PA), pp 184–191.
  21. ↵
    1. Pereira F
    (2000) Formal grammar and information theory: Together again? Philos Trans R Soc London 358:1239–1253.
    OpenUrlCrossRef
  22. ↵
    1. Gahl S,
    2. Garnsey S
    (2004) Knowledge of grammar, knowledge of usage: Syntactic probabilities affect pronunciation variation. Language 80:748–775.
    OpenUrlCrossRef
  23. ↵
    1. Bod R,
    2. Hay J,
    3. Jannedy S
    (2003) Probabilistic Linguistics (MIT Press, Cambridge, MA).
  24. ↵
    1. Aylett M,
    2. Turk A
    (2004) The smooth signal redundancy hypothesis: A functional explanation for relationships between redundancy, prosodic prominence, and duration in spontaneous speech. Lang Speech 47:31–56.
    OpenUrlCrossRefPubMed
  25. ↵
    1. Levy R
    (2005) Probabilistic models of word order and syntactic discontinuity. PhD thesis (Stanford University, Palo Alto, CA).
  26. ↵
    1. Jaeger T
    (2006) Redundancy and syntactic reduction in spontaneous speech. PhD thesis (Stanford University, Palo Alto, CA).
  27. ↵
    1. Genzel D,
    2. Charniak E
    (2002) Proceedings of the Fortieth Annual Meeting of the Association for Computational Linguistics, Entropy rate constancy in text (Association for Computational Linguistics, Stroudsburg, PA), pp 199–206.
PreviousNext
Back to top
Article Alerts
Email Article

Thank you for your interest in spreading the word on PNAS.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Rethinking language: How probabilities shape the words we use
(Your Name) has sent you a message from PNAS
(Your Name) thought you would like to see the PNAS web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Rethinking language: How probabilities shape the words we use
Thomas L. Griffiths
Proceedings of the National Academy of Sciences Mar 2011, 108 (10) 3825-3826; DOI: 10.1073/pnas.1100760108

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
Rethinking language: How probabilities shape the words we use
Thomas L. Griffiths
Proceedings of the National Academy of Sciences Mar 2011, 108 (10) 3825-3826; DOI: 10.1073/pnas.1100760108
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Mendeley logo Mendeley

Article Classifications

  • Social Sciences
  • Psychological and Cognitive Sciences

Related Articles

  • Word lengths are optimized for efficient communication
    - Jan 28, 2011
Proceedings of the National Academy of Sciences: 108 (10)
Table of Contents

Submit

Sign up for Article Alerts

Jump to section

  • Article
    • Quantitative Analysis of Language
    • The Rise of Probability
    • Footnotes
    • References
  • Info & Metrics
  • PDF

You May Also be Interested in

Smoke emanates from Japan’s Fukushima nuclear power plant a few days after tsunami damage
Core Concept: Muography offers a new way to see inside a multitude of objects
Muons penetrate much further than X-rays, they do essentially zero damage, and they are provided for free by the cosmos.
Image credit: Science Source/Digital Globe.
Water from a faucet fills a glass.
News Feature: How “forever chemicals” might impair the immune system
Researchers are exploring whether these ubiquitous fluorinated molecules might worsen infections or hamper vaccine effectiveness.
Image credit: Shutterstock/Dmitry Naumov.
Venus flytrap captures a fly.
Journal Club: Venus flytrap mechanism could shed light on how plants sense touch
One protein seems to play a key role in touch sensitivity for flytraps and other meat-eating plants.
Image credit: Shutterstock/Kuttelvaserova Stuchelova.
Illustration of groups of people chatting
Exploring the length of human conversations
Adam Mastroianni and Daniel Gilbert explore why conversations almost never end when people want them to.
Listen
Past PodcastsSubscribe
Horse fossil
Mounted horseback riding in ancient China
A study uncovers early evidence of equestrianism in ancient China.
Image credit: Jian Ma.

Similar Articles

Site Logo
Powered by HighWire
  • Submit Manuscript
  • Twitter
  • Facebook
  • RSS Feeds
  • Email Alerts

Articles

  • Current Issue
  • Special Feature Articles – Most Recent
  • List of Issues

PNAS Portals

  • Anthropology
  • Chemistry
  • Classics
  • Front Matter
  • Physics
  • Sustainability Science
  • Teaching Resources

Information

  • Authors
  • Editorial Board
  • Reviewers
  • Subscribers
  • Librarians
  • Press
  • Cozzarelli Prize
  • Site Map
  • PNAS Updates
  • FAQs
  • Accessibility Statement
  • Rights & Permissions
  • About
  • Contact

Feedback    Privacy/Legal

Copyright © 2021 National Academy of Sciences. Online ISSN 1091-6490