Previous Article |
Table of Contents
| Next Article
COLLOQUIUM PAPERS
Mixed-membership models of scientific publications


¶
*Department of Statistics, School of Social Work, and Center for Statistics and the Social Sciences, University of Washington, Seattle, WA 98195; and
Department of Statistics, ¶Computer Science Department, and
Center for Automated Learning and Discovery, Carnegie Mellon University, Pittsburgh, PA 15213
| Abstract |
|---|
|
|
|---|
PNAS is one of world's most cited multidisciplinary scientific journals. The PNAS official classification structure of subjects is reflected in topic labels submitted by the authors of articles, largely related to traditionally established disciplines. These include broad field classifications into physical sciences, biological sciences, social sciences, and further subtopic classifications within the fields. Focusing on biological sciences, we explore an internal soft-classification structure of articles based only on semantic decompositions of abstracts and bibliographies and compare it with the formal discipline classifications. Our model assumes that there is a fixed number of internal categories, each characterized by multinomial distributions over words (in abstracts) and references (in bibliographies). Soft classification for each article is based on proportions of the article's content coming from each category. We discuss the appropriateness of the model for the PNAS database as well as other features of the data relevant to soft classification.
The Proceedings is there to help bring new ideas promptly into play. New ideas may not always be right, but their prominent presence can lead to correction. We must be careful not to censor even those ideas which seem to be off beat.
Saunders MacLane (1)
Are there internal categories of articles in PNAS that we can obtain empirically with statistical data-mining tools based only on semantic decompositions of words and references used? Can we identify MacLane's "off-beat" but potentially path-breaking PNAS articles by using these internal categories? Do these empirically defined categories correspond in some natural way to the classification by field used to organize the articles for publication, or does PNAS publish substantial numbers of interdisciplinary articles that transcend these disciplinary boundaries? These are examples of questions that our contribution to the mapping of knowledge domains represented by PNAS explores.
Mathematical and statistical techniques have been developed for analyzing complex data in ways that could reveal underlying data patterns through some form of classification. Computational advances have made some of these techniques extremely popular in recent years. For example, 2 of the 10 most cited articles from 1997-2001 PNAS publications are on applications of clustering for gene-expression patterns (2, 3). The traditional assumption in most methods that aim to discover knowledge in underlying data patterns has been that each subject (object or individual) from the population of interest inherently belongs to only one of the underlying subpopulations (clusters, classes, aspects, or pure type categories). This implies that a subject shares all its attributes, usually with some degree of uncertainty, with the subpopulation to which it belongs. Given that a relatively small number of subpopulations is often necessary for a meaningful interpretation of the underlying patterns, many data collections do not conform with the traditional assumption. Subjects in such populations may combine attributes from several subpopulations simultaneously. In other words, they may have a mixed collection of attributes originating from more than one subpopulation.
Several different disciplines have developed approaches that have a common statistical structure that we refer to as mixed membership. In genetics, mixed-membership models can account for the fact that individual genotypes may come from different subpopulations according to (unknown) proportions of an individual's ancestry. Rosenberg et al. (4) use such a model to analyze genetic samples from 52 human populations around the globe, identifying major genetic clusters without using the geographic information about the origins of individuals. In the social sciences, such models are natural, because members of a society can exhibit mixed membership with respect to the underlying social or health groups for a particular problem being studied. Hence, individual responses to a series of questions may have mixed origins. Woodbury et al. (5) use this idea to develop medical classification. In text analysis and information retrieval, mixed-membership models have been used to account for different topical aspects of individual documents.
In the next section, we describe a class of mixed-membership models that unifies existing special cases (6). We then explain how this class of models can be adapted to analyze both the semantic content of a document and its citations of other publications. We fit this document-oriented mixed-membership model to a subcollection of the PNAS database supplied to the participants in the Arthur M. Sackler Colloquium Mapping Knowledge Domains. We focus in our analysis on a high-level description of the fields in biological sciences in terms of a small number of extreme or basis categories. Griffiths and Steyvers (7) use a related version of the model for abstracts only and attempt a finer level of description.
| Mixed-Membership Models |
|---|
|
|
|---|
Population Level. Assume there are K original or basis subpopulations in the populations of interest. For each subpopulation k, denote by f(xj|
kj) the probability distribution for response variable j, where
kj is a vector of parameters. Assume that, within a subpopulation, responses to observed variables are independent.
Subject Level. For each subject, membership vector
= (
1,...,
K) provides the degrees of a subject's membership in each of the subpopulations. The probability distribution of observed responses xj for each subject is defined fully by the conditional probability
and the assumption that response variables xj are independent, conditional on membership scores. In addition, given the membership scores, observed responses from different subjects are independent.
Latent-Variable Level. With respect to the latent variables, one could assume that they are either fixed unknown constants or random realizations from some underlying distribution.
are fixed but unknown, the conditional probability of observing xj, given the parameters
and membership scores, is
![]() | 1 |
are realizations of latent variables from some distribution D
, parameterized by vector
, then the probability of observing xj, given the parameters, is
![]() | 2 |
Sampling Scheme. Suppose R independent replications of J distinct characteristics are observed for one subject,
. Then, if the membership scores are treated as realizations from distribution D
, the conditional probability is
![]() | 3 |
When the latent variables are treated as unknown constants, the conditional probability for observing R replications of J variables can be derived analogously. In general, the number of observed characteristics J does not need to be the same across subjects, and the number of replications R does not need to be the same across observed characteristics.
One can derive examples of mixed-membership models from this general set up by specifying different choices of J and R and different latent-variable assumptions. Thus, the "grade-of-membership" model of Manton et al. (8) assumes that polytomous responses are observed to J survey questions without replications and uses the fixed-effects assumption for the membership scores. Potthoff et al. (9) use a variation of the grade-of-membership model by treating the membership scores as Dirichlet random variables; the authors refer to the resulting model as "Dirichlet generalization of latent class models." Erosheva (6) provides a formal latent-class representation for the grade-of-membership model approach. In genetics, Pritchard et al. (10) use a clustering model with admixture. For diploid individuals, the clustering model assumes that R = 2 replications (genotypes) are observed at J distinct locations (loci), treating the proportions of a subject's genome that originated from each of the basis subpopulations as random Dirichlet realizations. Variations of mixed-membership models for text documents called "probabilistic latent semantic analysis" (11) and "latent Dirichlet allocation" (12) both assume that a single characteristic (word) is observed a number of times for each document, but the former model considers the membership scores as fixed unknown constants, whereas the latter treats them as random Dirichlet realizations.
The mixed-membership model framework presented above unifies several specialized models that have been developed independently in the social sciences, genetics, and text-mining applications. In the text-mining area, initial work by Hofmann (11) on probabilistic latent semantic analysis was followed by the work of Blei et al. (12), who proposed a Dirichlet generating distribution for the membership scores and the use of variational methods to estimate the latent Dirichlet allocation model parameters. Minka and Lafferty (13) developed a more accurate approximation method for this model.
A natural extension of the original analyses in the text-mining area that have been based on a single source is to combine information from multiple sources. Cohn and Hofmann (14) propose a probabilistic model of document content and hypertext connectivity for text documents by considering links (or references) in addition to words, thus essentially combining two distinct characteristics; they treat the membership scores as fixed. Following Cohn and Hofmann, we adopt a mixed-membership model for words and references in journal publications but treat the membership scores as random Dirichlet realizations. Barnard et al. (15) develop similar and alternative approaches for combining different sources of information.
| Mixed-Membership Models for Documents |
|---|
|
|
|---|
, where
is a word (w) in the abstract and
is a reference (r) in the bibliography, rj = 1,..., Rj. By adopting the "bag-of-words" assumption, we treat the words in each abstract as independent replications of the first observed characteristic (word). Similarly, under the assumption of a "bag of references," we treat references as independent replications of the second observed characteristic (reference). Thus, the representation of a document consists of word counts n(w, d) (the number of times word w appears in document d) and reference counts n(r, d) (1 if the bibliography of d contains a reference to r, and 0 otherwise). In this context, subpopulations refer to topical aspects.
The parameters
of our model are: Dirichlet (hyper)parameters
1,...,
K for the generating distribution of the membership scores and aspect multinomial probabilities for words
1k(w) = p(w|k) and references
2k(r) = q(r|k), k = 1, 2,..., K.
In the generative model, documents
are sampled according to the following sequence,
![]() | 4 |
![]() | 5 |
![]() | 6 |
where
w
1k(w) = 1 and
r
2k(r) = 1, k = 1,..., K. Because distributions of words and references in a document are convex combinations of the distributions of the aspects, the aspects can be thought of as extreme or basis categories for a collection of documents. The sampling of words and references in the model can be interpreted also as a latent classification process in which an aspect of origin is drawn first for each word and for each reference in a document, according to a multinomial distribution parameterized by the document-specific membership scores
, and words and references then are generated from corresponding distributions of the aspects of origin (6). Rather than a mixture of K latent classes, the model can be thought of as a "simplicial mixture" (13) because the word and reference probabilities range over a simplex with corners
1k and
2k, respectively.
The likelihood function is thus
![]() | 7 |
![]() | 8 |
where integrals are over the (K - 1) simplex.
It is important to note that the assumption of exchangeability among words and references (conditional independence given the membership scores) does not imply joint independence among the observed characteristics. Instead, the assumption of exchangeability means that dependencies among words and references can be explained fully by the membership scores of the documents. For an extended discussion on exchangeability in this context, see ref. 16.
| Alternative Model for References |
|---|
|
|
|---|
Suppose an article focuses on a sufficiently narrow scientific area. In this case, the authors may have essentially perfect knowledge of the literature, and thus they would pay separate attention to each article in their pool of references as they consider whether to include it in the bibliography. Under these circumstances, given that the pool of references contains R articles, we assume that a document is represented as
, where
is a word in the abstract, R is the number of references, and x2,..., xR+1 are all references in the pool. Reference counts do not change: they are given by n(r, d) = 1 if the bibliography of d contains a reference to r and by n(r, d) = 0 if otherwise.
Then our model for generating documents would be to sample
and
, according to Eqs. 4 and 5, and sample xj, j = 2,..., R + 1, according to
![]() | 9 |
The likelihood function based on this alternative model would not only take into account which documents contain which references, but it also would incorporate the information about which references documents do not contain.
Both the basic model for references and any alternatives still would need to reflect the time ordering on publications and include in the pool of possible references only those that have been published already, perhaps even with a short time lag. However, even such changes are unlikely to produce a "correct" model for citation practices.
| Estimating the Model |
|---|
|
|
|---|
are bounded from below in a product form that leads to a tractable integral; the lower bound is then maximized. A related approach, called expectation-propagation (13), also approximates each mixture term in a product form but chooses the parameters of the factors by matching first and second moments. Either of these approximations to the integral (Eq. 7) can be used in an approximate expectation-maximization (EM) algorithm to estimate the parameters of the models. It is shown in ref. 13 that expectation-propagation in general leads to better approximations than the simple variational method for mixed-membership models, although we obtained comparable results with both approaches on the PNAS collection. The results reported below use the variational approximation. | The PNAS Database |
|---|
|
|
|---|
PNAS is one of world's most cited multidisciplinary scientific journals. Historically, when submitting a research paper to PNAS, authors have to select a major category from physical, biological, or social sciences and a minor category from the list of topics. PNAS permits dual classifications between major categories and, in exceptional cases, within a major category. The lists of topics change over time to reflect changes in the National Academy of Sciences sections. PNAS, in its information for authors (revised in June 2002), states that it classifies publications in biological sciences according to 19 topics; the numbers of published articles and numbers of dual-classified articles in each topic are shown in Table 1.
|
| Results |
|---|
|
|
|---|
To determine whether there are certain contexts that correspond to the aspects, we examine the most common words in the estimated multinomial distributions. In Table 2, we report the first 15 of the high-probability words for each aspect, filtering out so-called stop words, words that are generally common in English. An alternative way would be to discard the words from the "stop list" before fitting the model. If the distribution of stop words is not uniform across the internal categories, this alternative approach may potentially produce different results.
|
As for words, multinomial distributions are estimated for the references that are present in our collection. For estimation, we only need unique indicators for each referenced article. After the model is fitted, attributes of high-probability references for each aspect provide additional information about its contextual interpretation. Table 3 provides attributes of 15 high-probability references for each aspect that were available in the database together with PNAS citation counts (number of times cited by PNAS articles in the database). Notice that, because the model draws from the contextual decomposition, having a high citation count is not necessary for having high aspect probability. In Table 3, high-probability references for aspect 1 are dominated by publications in Nature; references in aspect 7 are mostly Nature, Cell, and Science publications from the mid-1990s.
|
Among frequent references for the eight aspects, there are seven PNAS articles that share a special feature: they were all either coauthored or contributed by a distinguished member of the National Academy of Sciences. In fact, one article was coauthored by a Nobel prize winner, and two were contributed by other Nobelists. Although these articles do not have the highest counts in the database, they are notable for various reasons; e.g., one is on clustering and gene expression (2), and it is also one of the two highly cited PNAS articles on clustering that we mentioned in the Introduction. These seven articles may not necessarily be off-beat, but they may be among those that fulfill MacLane's petition regarding the special nature of PNAS.
From our analysis of high-probability words, it is difficult to determine whether the majority of aspects correspond to a single topic from the official classifications in PNAS biological science publications. To investigate whether there is a correspondence between the estimated aspects and the given topics, we examine aspect loadings (means of posterior membership scores) for each article. Given estimated parameters of the model, the distribution of each article's loadings can be obtained by means of Bayes' theorem. The variational and expectation-propagation procedures provide Dirichlet approximations to the posterior distribution p(
|d,
) for each document d. We use the mean of this Dirichlet as an estimate of the weight of the document on each aspect. Histograms of these loadings are provided in Fig. 1 for articles in evolution and genetics. Relatively high histogram bars near zero correspond to the majority of articles having small posterior membership scores for the given aspect. Among the articles published in genetics, some can be considered as full members in aspects 2, 3, 4, and 6, but many have mixed membership in these and other aspects. Articles published in evolution, on the other hand, show a somewhat different behavior: the majority of these articles comes fully from aspect 2.
|
1 = 0.0195,
2 = 0.0203,
3 = 0.0569,
4 = 0.0346,
5 = 0.0317,
6 = 0.0363,
7 = 0.0411, and
8 = 0.0255. The estimated Dirichlet, which is the generative distribution of membership scores, is "bathtub-shaped" on the simplex; as a result, articles tend to have relatively high membership scores in only a few aspects. To summarize the aspect distributions for each topic, we provide mean loadings and the graphical representation of these values in Table 4 Upper. Larger values correspond to darker colors, and the values below some threshold are not shown (white) for clarity. As an example, the mean loading of 0.2883 for pharmacology in the first aspect is the average of the posterior means of the membership scores for this aspect over all pharmacology publications in the database. Note that this percentage is based on the assumption of mixed membership and can be interpreted as indicating that 29% of the words in pharmacology articles originate from aspect 1, according to our model.
|
Finally, we compare the loadings (posterior means of the membership scores) of dual-classified articles to those that are singly classified. We consider two articles as similar if their loadings are equal for the first significant digit for all aspects. One might interpret singly classified articles that are similar to dual-classified as articles that should have had dual classification but did not. We find that, for 11% of the singly classified articles, there is at least one similar dual-classified article. For example, three biophysics dual-classified articles with loadings 0.9 for the second and 0.1 for the third aspect turned out to be similar to 86 singly classified articles from biophysics, biochemistry, cell biology, developmental biology, evolution, genetics, immunology, medical sciences, and microbiology.
| Concluding Remarks |
|---|
|
|
|---|
In an often-quoted statement, Box remarked: "all models are wrong" (17). In our case, the assumption of a bag of words and references in the mixed-membership model clearly oversimplifies reality; the model does not account for the general structure of the language, nor does it capture the compositional structure of bibliographies. Many interesting extensions of the basic model we have explored are possible, from hierarchical models of topics to more detailed models of citations and dynamic models of the evolution of scientific fields over time. Nevertheless, as Box notes, even wrong models may be useful. Our results indicate that mixed-membership models can be useful for analyzing the implicit structure of scientific publications.
| Acknowledgements |
|---|
| Footnotes |
|---|
To whom correspondence should be addressed. E-mail: elena{at}stat.washington.edu.
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
K.-A. Sohn and E. P. Xing Spectrum: joint bayesian inference of population structure and recombination events Bioinformatics, July 1, 2007; 23(13): i479 - i489. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. G. Manton, V. L. Lamb, and XiLiang Gu Medicare Cost Effects of Recent U.S. Disability Trends in the Elderly: Future Implications J Aging Health, June 1, 2007; 19(3): 359 - 381. [Abstract] [PDF] |
||||
![]() |
N. A. Rosenberg and M. Nordborg A General Population-Genetic Model for the Production by Population Structure of Spurious Genotype-Phenotype Associations in Discrete, Admixed or Spatially Distributed Populations Genetics, July 1, 2006; 173(3): 1665 - 1678. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. W. Boyack Mapping knowledge domains: Characterizing PNAS PNAS, April 6, 2004; 101(suppl_1): 5192 - 5199. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. K. Landauer, D. Laham, and M. Derr From paragraph to graph: Latent semantic analysis for information visualization PNAS, April 6, 2004; 101(suppl_1): 5214 - 5219. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||