PeproTech, Our Business is Cytokines!  Sign up for PNAS Online eTocs
Link: Info for AuthorsLink: Editorial BoardLink: AboutLink: SubscribeLink: AdvertiseLink: ContactLink: Sitemap Link: PNAS Home
Proceedings of the National Academy of Sciences
Link: Current Issue "" Link: Archives "" Link: Online Submission ""  Link: Advanced Search

Published online on February 10, 2004, 10.1073/pnas.0307752101
PNAS | April 6, 2004 | vol. 101 | Suppl. 1 | 5228-5235


This Article
Right arrow Full Text
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a colleague
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My File Cabinet
Right arrow Download to citation manager
Right arrow Request Copyright Permission
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via CrossRef
Right arrow Citing Articles via ISI Web of Science (21)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Griffiths, T. L.
Right arrow Articles by Steyvers, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Griffiths, T. L.
Right arrow Articles by Steyvers, M.
Related Content
Right arrow Related Web Pages
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg  
What's this?

 Previous Article  | Table of Contents |  Next Article 

COLLOQUIUM PAPERS
Finding scientific topics

Thomas L. Griffiths * {dagger} {ddagger}, and Mark Steyvers §

*Department of Psychology, Stanford University, Stanford, CA 94305; {dagger}Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139-4307; and §Department of Cognitive Sciences, University of California, Irvine, CA 92697

A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying "hot topics" by examining temporal dynamics and tagging abstracts to illustrate semantic content.


This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, "Mapping Knowledge Domains," held May 9-11, 2003, at the Arnold and Mabel Beckman Center of the National Academies of Sciences and Engineering in Irvine, CA.

These estimates cannot be combined across samples for any analysis that relies on the content of specific topics. This issue arises because of a lack of identifiability. Because mixtures of topics are used to form documents, the probability distribution over words implied by the model is unaffected by permutations of the indices of the topics. Consequently, no correspondence is needed between individual topics across samples; just because two topics have index j in two samples is no reason to expect that similar words were assigned to those topics in those samples. However, statistics insensitive to permutation of the underlying topics can be computed by aggregating across samples.

{ddagger} To whom correspondence should be addressed. E-mail: gruffydd{at}psych.stanford.edu.


Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg    What's this?


Related Web Pages:

NAS Sackler Colloquium on Mapping Knowledge Domains

This article has been cited by other articles in HighWire Press-hosted journals:


Home page
Neural Comput.Home page
T. Iwata, K. Saito, N. Ueda, S. Stromsten, T. L. Griffiths, and J. B. Tenenbaum
Parametric embedding for class visualization.
Neural Comput., September 1, 2007; 19(9): 2536 - 2556.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
X. Lu, B. Zheng, A. Velivelli, and C. Zhai
Enhancing Text Categorization with Semantic-enriched Representation and Training Data Augmentation
J. Am. Med. Inform. Assoc., September 1, 2006; 13(5): 526 - 535.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
K. W. Boyack
Mapping knowledge domains: Characterizing PNAS
PNAS, April 6, 2004; 101(suppl_1): 5192 - 5199.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
T. K. Landauer, D. Laham, and M. Derr
From paragraph to graph: Latent semantic analysis for information visualization
PNAS, April 6, 2004; 101(suppl_1): 5214 - 5219.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
E. Erosheva, S. Fienberg, and J. Lafferty
Mixed-membership models of scientific publications
PNAS, April 6, 2004; 101(suppl_1): 5220 - 5227.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
P. Ginsparg, P. Houle, T. Joachims, and J.-H. Sul
Mapping subsets of scholarly information
PNAS, April 6, 2004; 101(suppl_1): 5236 - 5240.
[Abstract] [Full Text] [PDF]