# Reconceptualizing the classification of PNAS articles

^{a}Department of Statistics and Faculty of Arts and Sciences Center for Systems Biology, Harvard University, Cambridge, MA 02138;^{b}Department of Statistics, School of Social Work, and the Center for Statistics and the Social Sciences, University of Washington, Seattle, WA 98195;^{c}Department of Statistics, Carnegie Mellon University, Pittsburgh, PA 15213;^{d}Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213;^{e}Département de Mathématiques, Université de Montpellier, Place Eugène Bataillon, 34095 Montpellier Cedex 5, France; and^{f}Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY 14642

See allHide authors and affiliations

Contributed by Stephen E. Fienberg, October 8, 2010 (sent for review August 6, 2009)

## Abstract

PNAS article classification is rooted in long-standing disciplinary divisions that do not necessarily reflect the structure of modern scientific research. We reevaluate that structure using latent pattern models from statistical machine learning, also known as mixed-membership models, that identify semantic structure in co-occurrence of words in the abstracts and references. Our findings suggest that the latent dimensionality of patterns underlying PNAS research articles in the Biological Sciences is only slightly larger than the number of categories currently in use, but it differs substantially in the content of the categories. Further, the number of articles that are listed under multiple categories is only a small fraction of what it should be. These findings together with the sensitivity analyses suggest ways to reconceptualize the organization of papers published in PNAS.

The *Proceedings of the National Academy of Sciences* (PNAS) is indexed by Physical, Biological, and Social Sciences categories, and, within these, by subclassifications that correspond to traditional disciplinary topics. When submitting a paper, authors classify it by selecting a major and a minor category. Although authors *may* opt to have dual or even triple indexing, only a small fraction of published PNAS papers do so. How well does the current classification scheme capture modern interdisciplinary research? Could some alternative structure better serve PNAS in fostering publication and visibility of the best interdisciplinary research? These questions may be thought of as falling under the broad umbrella of “knowledge mapping.”

A special 2004 supplement of PNAS, based on the *Arthur M. Sackler Colloquium on Mapping Knowledge Domains*, presented a number of articles that applied various knowledge mapping techniques to the contents of PNAS itself (1). What was striking about the issue is that two articles by Erosheva, et al. (2, henceforth EFL) and Griffiths and Steyvers (3, henceforth GS), based on similar statistical machine learning models, made statements about the number of inferred categories needed to describe semantic patterns in PNAS articles that differed by more than an order of magnitude (10 versus 300). Here we revisit these earlier analyses in the light of a new one and attempt (*i*) to understand the differences between them and (*ii*) to estimate the minimal number of latent categories necessary to describe modern scientific research, often interdisciplinary, as reported in PNAS.

To set the stage, we provide a brief overview of the relevant models and summarize the similarities and differences between the two approaches and corresponding analyses presented in refs. 2 and 3. Using the same database as in EFL (2), we explore a wide range of analytic and modeling choices in our attempt to reconcile the differences in prior analyses. We approach the choice of the number of “latent categories,” which are inferred from data, with multiple strategies including one similar to that used by GS (3). Our findings suggest that 20 to 40 latent categories suffice to describe PNAS Biological Sciences publications, 1997–2001. Thus a reconceptualization of the indexing for PNAS Biological Sciences articles would require at most doubling the 19 traditional disciplinary categories. Because the true number of underlying semantic patterns is unknown and unknowable, we also report on a simulation study that confirms that, were there as few as 20 topics, our methodology would come close to estimating this number in a reasonable way. We also suggest some implications of our reconceptualization for the multiple indexing of interdisciplinary research in PNAS and elsewhere.

## Overview of the Earlier Analyses

EFL (2) and GS (3) both analyzed data extracted from PNAS articles from an overlapping time period using versions of mixed-membership models (4). A distinctive feature of mixed-membership models for documents is the assumption that articles may combine words (plus any other attributes such as references) from several latent categories according to proportions of the article’s membership in each category. The latent categories are not observable. They are typically estimated from data together with the proportions. The latent categories need not correspond to existing PNAS disciplinary classifications. Rather, each category can be thought of as a probability distribution over document-specific attributes that specifies which set of, say, words and references, co-occur frequently. The latent categories are often a quantitative by-product of concepts and semantic patterns that are used in a specific disciplinary area more than in others.

A mixed-membership structure allows for a parsimonious representation of interdisciplinary research without the need to create separate categories to accommodate both existing disciplinary links and new forms of collaborative research. Mixed-membership models achieve this through specifying article-level membership parameter vectors. In general, formulating mixed-membership models requires a combination of assumptions at the population level (e.g., PNAS Biological Sciences), subject level (individual articles), latent variable level (article’s membership vector), and the sampling scheme for generating subject’s attributes (article’s words and/or references). Variations of these assumptions can easily produce different mixed-membership models, and the models used by EFL and GS are special cases of the general mixed-membership model framework presented by EFL.

We summarize other aspects of analytic choices, model fitting, and model selection strategies by EFL and GS in Table 1. We believe that analytic decisions, such as working with the Biological Sciences articles* versus with all PNAS articles, including commentaries and reviews in the database, or excluding rare words from the analysis, cannot account for the order of magnitude difference in the most likely number of latent categories inferred from the similar data. Given that the models were so similar, we questioned the discrepancy between 8 to 10 latent categories used by EFL and 300 likely latent categories reported by GS. Why was there such a large difference in this key feature around which all other results revolved? More importantly, in light of this issue, can this type of statistical model support a substantive reconceptualization of the classification scheme in use by PNAS?

Below, we report on new analyses and results for the PNAS data and offer evidence in support of the utility of mixed-membership analysis for grounding considerations about a useful reconceptualization of PNAS categories.

## Main Analysis

### Mixed-Membership Models.

We attempted to reconcile the differences in the original analyses of EFL and GS as follows: First, we used a common database for all models considered in this paper. Second, we varied data sources and hyperparameter estimation strategies to closely match those of the original analyses. Third, we remedied the absence of dimensionality selection strategy in EFL by allowing the number of latent categories, *K*, to change between 2 and 1,000, and comparing goodness of fit for different values of *K*.

Table 2 summarizes the resulting four mixed-membership models in a 2 × 2 layout. Model 3 is the closest to EFL’s model except that we now employ a symmetric Dirichlet distribution (*α*_{k} = *α* for all *k*) that matches GS’s assumption. Model 2 uses the same data source and hyperparameter estimation strategy as in GS. We include models 1 and 4 to complement the other two by balancing the choice of data and estimation strategies.

Let *x*_{1} be the observed words in the article’s abstract and *x*_{2} be the observed references in the bibliography. We assume that words and references come from finite discrete sets (vocabularies) of sizes *V*_{1} and *V*_{2}, respectively. For simplicity, we assume that the vocabulary sets are common to all articles, independent of the publication time. We assume that the distribution of words and references in an article is driven by an article’s membership in each of *K* latent categories, *λ* = (*λ*_{1},…,*λ*_{K}), representing proportions of attributes that arise from a given latent pattern; *λ*_{k}≥0 for *k* = 1,2,…,*K* and . We denote the probabilities of the *V*_{1} words and the *V*_{2} references in the *k*th pattern by *θ*_{k1} and *θ*_{k2}, for *k* = 1,2,…,*K*. These vectors of probabilities define multinomial^{†} distributions over the two vocabularies of words and references for each latent category. We assume that article-specific (latent) vectors of mixed-membership scores are realizations from a symmetric Dirichlet^{‡} distribution. For an article with *R*_{1} words in the abstract and *R*_{2} references in the bibliography, the generative sampling process for the mixed-membership model is as follows:

### Mixed-Membership Models: Generative Process.

Sample

*λ*∼ Dirichlet(*α*_{1},*α*_{2},…,*α*_{K}), where*α*_{k}=*α*, for all*k*.Sample

*x*_{1}∼ Multinomial(*p*_{1λ},*R*_{1}), where .Sample

*x*_{2}∼ Multinomial(*p*_{2λ},*R*_{2}), where .

This process corresponds to models 3 and 4 in Table 2. The process for models 1 and 2 relies on steps 1 and 2 where only words in abstracts, *x*_{1}, are sampled. The conditional probability of words and references in an article is then

### Estimation and Posterior Inference.

Given a collection of articles, we treat pattern-specific distributions of words and references, {*θ*_{k1}} and {*θ*_{k2}}, as constant quantities to be estimated, and article-specific proportions of membership *λ*_{k} as incidental parameters whose posterior distributions we compute. We assume that the hyperparameter α is unknown and estimated from the data in models 1 and 3; we fix the value of α at 50/*K* following the GS’s heuristic in models 2 and 4. We carry out estimation and inference using the variational expectation-maximization algorithm (5, 6). Variational methods provide an approximation to a joint posterior distribution when the likelihood is intractable. When we fix the hyperparameter α, as in models 2 and 4, we can use a Gibbs sampler to obtain the exact joint posterior distribution as implemented by GS in their original analysis. When we estimate α, however, we rely on variational approximations for estimation and inference. Simulation studies for the Grade of Membership model have shown that results obtained from both estimation methods are similar (7). We give full details in *SI Text*.

### Dimensionality Selection.

Each time we fit a mixed-membership model to data, we must specify the number of latent categories, *K*, in the model. The goal of dimensionality selection is to identify a number of latent categories *K*^{∗} that is *optimal* in some sense. We identify the number of latent categories that leads to an optimal model-based summary of the database of scientific articles in a predictive sense, by means of a battery of out-of-sample experiments involving a form of cross-validation. We use 5-fold cross-validation, common in the machine learning literature, e.g., ref. 8, and explain the rationale for this choice in *SI Text*. Each out-of-sample experiment consists of five model fits for a given value of *K*. First, we split the *N* articles into five batches. Then, in turn, we estimate the model parameters using the articles in four batches, and we compute the likelihood of the articles in the fifth held-out batch. This leads to mean and variability estimates of quantities that summarize the goodness of fit of the model for a given *K*, on a batch of articles not included in the estimation. We consider a grid of values for *K* that range from a small to a large number of latent categories; namely, *K* = 2,…,5,10,…,45,50,75,100,200,…,900,1,000.

## Sensitivity Analyses

Fitting the four mixed-membership models from Table 3 to the PNAS dataset allows us to examine the impact of using references and estimating the hyperparameter α in a 2 × 2 design. We examine the sensitivity of empirical PNAS results obtained with mixed-membership models by considering the impact on model fit and selection of (*i*) our key assumption of mixed membership and our simple bag-of-references model. In addition, (*ii*) we use a simulation study to investigate the methodological issue of the potential impact on dimensionality selection due to fixing hyperparameter α, following the strategy of GS. Finally, to address interpretation issues, we study (*iii*) the distributions of shared memberships for different values of *K* and investigate (*iv*) whether increases in model dimension *K* beyond some optimal value change the macrostructure of the latent categories.

### (i) Alternative Models.

To study sensitivity of our latent dimensionality results to the key assumption of mixed membership, we implement another mixture model assuming that research reports belong to only one of the latent categories. This full-membership model can be thought of as a special case of the mixed-membership model where, for each article, all but one of the membership scores are restricted to be zero. As opposed to traditional finite mixture models that are formulated conditional on the number of latent categories *K*, this model variant allows the joint estimation of the latent categories, θ, and of the model dimension *K*.

We assume an infinite number of categories and implement this assumption through a Dirichlet process prior, *D*_{α}; for *λ*, e.g., see refs. 9 and 10. The distribution *D*_{α} models the prior probabilities of latent pattern assignment for the collection of documents. In particular, for the *n*th article, given the set of assignments for the remaining articles, *λ*_{-n}, this prior puts a probability mass on the *k*th pattern (out of *K*_{-n} distinct patterns observed in the collection of documents excluding the *n*th one), which is proportional to the number of documents associated with it. The prior distribution also puts a probability mass on a new, (*K*_{-n} + 1)th latent semantic pattern, which is distinct from the patterns (1,…,*K*_{-n}) observed in *λ*_{-n}. That is, *D*_{α} entails prior probabilities for each component of *λ*_{n} as follows: where *m*(-*n*,*k*) is the number of documents that are associated with the *k*th latent pattern, excluding the *n*th document, i.e., .

The generative sampling process for this full-membership model is as follows:

Sample

*λ*∼ Dirichlet Process(*α*)Sample

*x*_{1}∼ Multinomial(*θ*_{k1},*R*_{1}), where*λ*_{n[k]}= 1.Sample

*x*_{2}∼ Multinomial(*θ*_{k2},*R*_{2}), where*λ*_{n[k]}= 1.

As with mixed-membership models, we considered two versions of the data: words from the abstract and references from the bibliography of the collection of articles. Model 5 corresponds to this process with steps 1 and 2, where we sample only words, *x*_{1}. Model 6 corresponds to this process with steps 1–3, where we sample words and references, *x*_{1} and *x*_{2}. We provide full details about estimation and inference via Markov chain Monte Carlo methods in *SI Text*.

Additionally, we fit a mixed-membership model with a time-dependent bag of references. This confirmed that giving up the time resolution of the articles in our database has a negligible impact on model selection results.

### (ii) Simulation Study.

We simulated data from a mixed-membership model with *K*^{*} = 20 latent categories to obtain a corpus of documents we could use as ground truth. We used a vocabulary of 1,000 words and simulated 5,000 documents. We sampled the length of each document from a Poisson distribution with a mean of 100 words. We set the hyperparameter controlling mixed membership equal to *α* = 0.01, whereas we sampled the banks of Bernoulli parameters corresponding to latent patterns from a symmetric Dirichlet with hyperparameter 0.01. We then treated *K*^{*} as unknown and approached model estimation in two ways: (*i*) by estimating the hyperparameter α as in our main analysis and (*ii*) by setting the hyperparameter *α* = 50/*K*, for a given *K*, according to the ad hoc strategy implemented by GS.

For model selection purposes, we considered a grid for *K* as follows: increments of 4 for 10 ≤ *K* ≤ 50, increments of 10 for 60 ≤ *K* ≤ 100, and increments of 50 for 150 ≤ *K* ≤ 500. Thus, we fit the model 25 times for each of 24 values of *K*.

### (iii) Shared Memberships.

Assume that a document is associated with latent category *k*∈{1,…,*K*} if and only if its membership score for this category is greater than *sd* + 1/*K*, where *sd* is the posterior standard deviation of the membership scores. For each value of *K* in our grid, we computed the number of documents associated with exactly *k*∈{1,…,*K*} latent categories. We then examined these distributions of the shared membership for a range of models with up to *K* = 300.

### (iv) Macrostructure.

Attempting to interpret the latent categories manually for all values of *K* in our grid is unreasonable. Hence, we analyzed computationally whether increases in model dimension *K* destroy the macrostructure and reorganize the latent categories by comparing multinomial probabilities for the latent patterns from the model with a smaller dimension *K*^{*} with the closest-matching ones of the model with a larger dimension.

We provide details for sensitivity analyses (iii) and (iv) in *SI Text*.

## Main Results

### Dimensionality.

Our primarily goal is to assess qualitatively and quantitatively a reasonable range for the number of latent categories underlying the PNAS database. Our analysis offers some insights into the impact on the results from differences in the models and the inference strategies. The simulation study also investigates the impact of such differences on model fit and dimension selection in a controlled setting.

### Dimensionality for Mixed-Membership Models.

To provide a quantitative assessment of model fit in terms of the number of latent categories *K*, we relied on their predictive performance with out-of-sample experiments, as described above. Recall that for mixed-membership analysis with models 1–4, we assume that *K* is an unknown constant. We split the articles into five batches to be used for all values of *K*. We considered values of *K* on a grid, spanning a range between 2 and 1,000. To summarize goodness of fit of the model in a predictive sense, we examine the held-out probability, that is, the probability computed on the held-out batch of articles.^{§}

For each value of *K* on the grid, we computed the average held-out log-probability value over the five model fits. Fig. 1 summarizes predictive performance of the mixed-membership models 1–4, for values of *K* = 2,…,100 (the average log-probability values continued to decline gradually for *K* greater than 100). The goodness of fit improves when we estimate the hyperparameter α (solid lines); however, all plots suggest an optimal choice of *K* falls in the range of 20–40, independent of the estimation strategy for α and of references inclusion. Values of *K* that maximize the held-out log probability are somewhat greater when the database includes references. We obtained similar dimensionality results using the Bayesian information criterion (11).

### Dimensionality for Full-Membership Models.

Although we base the choice of *K* for mixed-membership models 1–4 on their predictive performance, semiparametric full-membership models 5 and 6 allow us to examine posterior distribution of *K*.

Fig. 2 shows the posterior distributions on *K*—density on the *Y* axis versus values of *K* on the *X* axis—obtained by fitting data to semiparametric models with words only (model 5, solid line) and words and references (model 6, dashed line). The maximum a posteriori estimate of *K* is smaller for the model including references compared to the model with words only. Further, the posterior range of *K* is smaller for the model including references. Thus adding references to the models reduces the posterior uncertainty about *K*.

### Dimensionality: Overall.

Our simulation showed that setting the hyperparameter α as a function of *K* in the same way as GS did had the greatest impact on estimates of the document-specific mixed-membership vectors, leading to a modest upward bias in the choice of an optimal *K*, but did not result in an order of magnitudes difference. We provide more detailed results on our simulation study in *SI Text*.

Overall, for all six models, values of *K* in the range of 20–40 are plausible choices for the number of latent categories in PNAS Biological Sciences research reports, 1997–2001.

### Qualitative and Quantitative Analysis of Inferred Categories.

For illustrative purposes, we consider *K*^{*} = 20 for the mixed-membership model with words and references. We obtain qualitative descriptions of the latent categories using two approaches: via examining high probability words and references in each category and via comparing the model-based inferred article categories with the original PNAS classifications.

Studying the lists of words and references that are most likely to occur according to the distribution of each latent category, we see some interesting patterns that are distinct from current PNAS classifications. For example, category 5 focuses on the process of apoptosis and genetic nuclear activity in general. Category 12 concerns peptides. Several categories relate to protein studies including pattern 8 that deals with protein structure and binding. We offer an interpretation of *all* the topics in *SI Text* in an effort to demonstrate what a reasonable model fit should look like.

To examine the relationship between the 20 inferred categories and the 19 original PNAS categories in the Biological Sciences, we plot in Fig. 3 the average membership of the set of documents in the *i*th PNAS class (row) in the *k*th latent category (column).^{¶} We threshold the average membership scores so that small values (less than 10%) would not distract from the visual pattern.

Fig. 3 (*Left* to *Right*) details results for models 1 and 2 (words only) and models 3 and 4 (words and references). The results reveal the impact of expanding the database to include references and of setting the hyperparameter at *α* = 50/*K* (models 2 and 4). When we include the references, the relationship of estimated latent categories with designated PNAS classifications becomes more composite for each estimation method. When we estimate the hyperparameter α, we observe a better agreement between estimated latent categories and the original PNAS classifications. A greater number of darker color blocks point to more articles with estimated substantial membership in just a few latent categories for the α-estimated models. Lighter blocks for the constrained-α models may be due to more spread-out membership (due to small membership values of all articles) or to an apparent disagreement of estimated membership vectors among articles from original PNAS classifications. Either explanation leads us to conclude that estimating hyperparameters gives us a model that has a better connection to the original PNAS classification.

From an inspection of the estimated categories, we see that small subclassifications such as Anthropology do not result in separate categories and broad ones such as Microbiology and Pharmacology have distinct subpatterns within them. Nearly all of the PNAS classifications are represented by several word-and-reference co-occurrence patterns, consistently across models.

Fig. 4 shows the distributions of shared memberships for varying values of *K* based on model 3. Overall, no matter what the dimensionality of the model, most articles tend to be associated with about five or fewer latent categories. For *K*^{*} = 20, 37% of articles are associated with two latent categories and 2% with three categories, the theoretical upper bound on the number of associations in this case. *SI Text* provides further details.

When we investigated the impact of increases in dimensionality *K* on interpretation, we found substantial reorganization among distributions of words and references in the latent categories. We compared estimated multinomial distributions for words and references between categories from pairs of models of dimensions *K*_{1} and *K*_{2}, where *K*_{1} < *K*_{2}, by computing correlations between all pairs of vectors. We found that correlations between the *K*_{1} vectors in the smaller model and *K*_{1} best-matching vectors from the larger model tend to diminish as *K*_{2} increases, indicating that the macrostructure is not preserved. As expected, we also found that correlations between the *K*_{1} vectors in the smaller model and additional vectors from the larger model were small.

### Predictions.

Recall that our database includes 11,988 articles, classified by the authors into 19 subcategories of the Biological Sciences section. Of these, 181 were identified by their authors as having dual classifications. Here, we identify publications that have similar membership vectors to dual-classified articles, i.e., single-classified articles that may have been also cross-submitted. Table 3 summarizes these results. The parametric models 1–4 predict that respectively 554, 114, 1,008, and 538 additional articles were *similar* to the author identified dual-classified articles. By similar, we mean that their mixed-membership vectors in the 20 latent semantic patterns match a membership vector of a dual-classified article to the first significant digit. Of particular interest is the large proportion of Biochemistry, Neurobiology, Biophysics, and Evolution articles that our analyses suggest as potential dual-classified articles.

## Discussion

We have focused on alternative specifications for mixed-membership models to explore ways to classify papers published in PNAS that capture, in a more salient fashion, the interdisciplinary nature of modern science. Through the data analysis of 5 y of PNAS Biological Science articles, we have demonstrated that a small number of classification topics do an adequate job of capturing the semantic structure of the published articles. They also provide us with a reasonable correspondence to the current PNAS classification structure.

The machine learning literature contains many variants of mixed-membership models for classification and clustering problems. For example, Blei and Lafferty (12) describe a dynamic topic model and apply it to data from 125 y of *Science*. A different approach to references might exploit the network structure of authors with the mixed-membership stochastic block model of ref. 13 or the author–topic model of ref. 14; see also a review of such models in the psychological literature (15). The selection of appropriate dimension for number of latent categories, *K*, is often hidden behind the scene in applications, with some exceptions such as those involving a probability distribution over the number of dimensions such as models with the Dirichlet process (16) and its many variants (17–20).

Here we provide an extended analysis of dimensionality in a database of PNAS publications, contrasting our findings with earlier published ones (2, 3). The consistency of our results across multiple variants of mixed-membership models indicates that this type of statistical analysis, when done carefully, could support a substantive reconceptualization of the classification scheme used by PNAS. A more in-depth study of semantic patterns, inferred from actual data extracted from papers published in PNAS using tools such as those described in this paper, would also assist in the review process and the indexing of published papers, to reflect modern, overlapping, and interdisciplinary scientific publications. Finally, instead of relying solely on citations, researchers could be suggested related work via articles with the “most similar” semantic patterns, in an automated manner.

## Footnotes

Author contributions: E.M.A., E.A.E., S.E.F., C.J., and T.L. designed research; E.M.A., E.A.E., S.E.F., C.J., T.L., and S.S. performed research; E.M.A., E.A.E., S.E.F., C.J., T.L., and S.S. contributed new reagents/analytic tools; E.M.A., E.A.E., S.E.F., C.J., T.L., and S.S. analyzed data; and E.M.A., E.A.E., S.E.F., C.J., and T.L. wrote the paper.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1013452107/-/DCSupplemental.

↵

^{*}Of 13,008 research articles published during this five-year period, 12,036 or 92.53% were in the Biological Sciences.↵

^{†}A multinomial distribution quantifies the intuition that words (or references) occur at each position in an abstract (or a bibliography) with different probabilities. The data suggest which words and references are most popular in articles that express each latent category.↵

^{‡}A symmetric Dirichlet distribution quantifies the intuition that an article tends to belong to a few latent categories, when*α*< 1. As*α*> 1, an article belongs to more and more latent categories. The data suggest that for articles in the biological sciences, implying that each research article covers only a few scientific areas.↵

^{§}Technically, the held-out probability is a variational lower bound on the likelihood of the held-out documents, as we detail in*SI Text*.↵

^{¶}We show only the 13 most frequently used disciplinary categories here, but we provide the complete figure in*SI Text*.

## References

- ↵
- Shiffrin RM,
- Börner K

- ↵
- Erosheva EA,
- Fienberg SE,
- Lafferty J

- ↵
- Griffiths TL,
- Steyvers M

- ↵
- Weihs C,
- Gaul W

- Erosheva EA,
- Fienberg SE

- ↵
- ↵
- ↵
- ↵
- Hastie T,
- Tibshirani R,
- Friedman JH

- ↵
- ↵
- ↵
- ↵
- Cohen WW,
- Moore A

- Blei DM,
- Lafferty JD

- ↵
- Airoldi EM,
- Blei DM,
- Fienberg SE,
- Xing EP

- ↵
- Rosen-Zvi M,
- Chemudugunta C,
- Griffiths TL,
- Smyth P,
- Steyvers M

- ↵
- ↵
- ↵
- Griffiths TL,
- Ghahramani Z

- ↵
- ↵
- ↵
- Duan JA,
- Guindani M,
- Gelfand AE

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Statistics

- Social Sciences
- Social Sciences