Previous Article |
Table of Contents
| Next Article
COLLOQUIUM PAPERS
Finding scientific topics


*Department of Psychology, Stanford University, Stanford, CA 94305;
Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139-4307; and
Department of Cognitive Sciences, University of California, Irvine, CA 92697
A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying "hot topics" by examining temporal dynamics and tagging abstracts to illustrate semantic content.
¶ These estimates cannot be combined across samples for any analysis that relies on the content of specific topics. This issue arises because of a lack of identifiability. Because mixtures of topics are used to form documents, the probability distribution over words implied by the model is unaffected by permutations of the indices of the topics. Consequently, no correspondence is needed between individual topics across samples; just because two topics have index j in two samples is no reason to expect that similar words were assigned to those topics in those samples. However, statistics insensitive to permutation of the underlying topics can be computed by aggregating across samples.
To whom correspondence should be addressed. E-mail: gruffydd{at}psych.stanford.edu.
![]()
CiteULike
Complore
Connotea
Del.icio.us
Digg What's this?
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
T. Iwata, K. Saito, N. Ueda, S. Stromsten, T. L. Griffiths, and J. B. Tenenbaum Parametric embedding for class visualization. Neural Comput., September 1, 2007; 19(9): 2536 - 2556. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Lu, B. Zheng, A. Velivelli, and C. Zhai Enhancing Text Categorization with Semantic-enriched Representation and Training Data Augmentation J. Am. Med. Inform. Assoc., September 1, 2006; 13(5): 526 - 535. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. W. Boyack Mapping knowledge domains: Characterizing PNAS PNAS, April 6, 2004; 101(suppl_1): 5192 - 5199. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. K. Landauer, D. Laham, and M. Derr From paragraph to graph: Latent semantic analysis for information visualization PNAS, April 6, 2004; 101(suppl_1): 5214 - 5219. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Erosheva, S. Fienberg, and J. Lafferty Mixed-membership models of scientific publications PNAS, April 6, 2004; 101(suppl_1): 5220 - 5227. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Ginsparg, P. Houle, T. Joachims, and J.-H. Sul Mapping subsets of scholarly information PNAS, April 6, 2004; 101(suppl_1): 5236 - 5240. [Abstract] [Full Text] [PDF] |
||||