Temporal patterns of genes in scientific publications
- †Program for Evolutionary Dynamics, Harvard University, One Brattle Square, Cambridge, MA 02138; and
- §Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street, Cambridge, MA 02139
-
Edited by Robert May, University of Oxford, Oxford, United Kingdom, and approved June 7, 2007 (received for review February 14, 2007)
Abstract
Publications in scientific journals contain a considerable fraction of our scientific knowledge. Analyzing data from publication databases helps us understand how this knowledge is obtained and how it changes over time. In this study, we present a mathematical model for the temporal dynamics of data on the scientific content of publications. Our data set consists of references to thousands of genes in the >15 million publications listed in PubMed. We show that the observed dynamics may result from a simple process: Researchers predominantly publish on genes that already appear in many publications. This might be a rewarding strategy for researchers, because there is a positive correlation between the frequency of a gene in scientific publications and the journal impact of the publications. By comparing the empirical data with model predictions, we are able to detect unusual publication patterns that often correspond to major achievements in the field. We identify interactions between yeast genes from PubMed and show that the frequency differences of genes in publications lead to a biased picture of the resulting interaction network.
Cultural information, including scientific knowledge, changes over time. It has been argued that the dynamics of this change resemble an evolutionary process (1–7). Philosophers have developed “evolutionary epistemologies” to describe the dynamics of changes in scientific theories (2), whereas biologists such as Dawkins introduced the concept of “memes” to highlight the analogies in the evolution of biological and cultural information (3, 6, 7). Although the concept of “memes” has been criticized for being overly simplistic (8–10), and it is unclear how far-reaching the analogies between cultural and biological information are, an evolutionary framework for the dynamics of cultural information remains, in our view, highly plausible.
A quantitative analysis of empirical data is an essential step toward a more detailed understanding of the processes that change cultural information. For scientific knowledge, excellent time-resolved data are available from publication databases such as PubMed (www.ncbi.nlm.nih.gov/entrez). Several studies use these data mainly to analyze properties of citation and collaboration networks (11–17). Here, we present an analysis of the dynamics of content-related terms in the literature. We analyze data on the temporal patterns of thousands of genes in the titles and abstracts of scientific publications. This approach allows us to follow the dynamics of scientific progress in genetics and related research fields based on a large set of well defined and content-related terms.
To identify gene names, including synonyms and names of gene products in abstracts and titles of scientific publications listed in PubMed, we use the iHOP text-mining system (18, 19). PubMed is the major public database for publications in the life sciences. Depending on the species, the average recall of gene quotations in iHOP is ≈80%, and average precision is ≈95% (19). For some species such as yeast (Saccharomyces cerevisiae), these values reach 83% and 99%, respectively. A more detailed description has been given earlier and is also available on the iHOP web site (www.ihop-net.org/UniPub/iHOP/info/gene_index/index.html). For yeast, there are >4,000 genes that in total appear ≈80,000 times in ≈35,000 publications during the time span covered by PubMed (1975–2006). The total number of references to yeast genes in the literature shows an approximately exponential growth phase until the mid-1990s, followed by a phase of saturation (Fig. 1). The frequency distribution of genes in scientific publications resembles a power law (20). A few highly popular genes dominate the literature. Similar distributions have been observed in other types of bibliometric data sets, such as the frequency distribution of citations of scientific publications, where a few top papers attract most of the references (11–13).
Temporal patterns of publications on yeast genes. (A) Number of publications per year of the most popular gene (ACT1) and a typical gene (PET54). There is a high level of stochasticity in the temporal pattern, and there are large differences in the publication dynamics between typical and popular genes. Data are shown with solid lines. Results from a sample simulation with parameters estimated from the data are shown with dashed lines. (B) The total number of gene names in titles and abstracts of scientific publications (dots) shows exponential growth until about 1994, followed by a phase of saturation. The sample simulation (crosses) shows an excellent fit with this overall growth. (C) Double-logarithmic plot of the relative frequency at which genes appear in publications vs. the number of genes that appear at that frequency. The distribution shows a long tail, i.e., the literature is dominated by a few highly popular genes. Although the most popular gene, ACT1, appears in >1,400 publications, there are ≈650 genes that appear in only one publication. The frequency distribution emerging in the simulations (crosses) is similar to the observed frequency distribution (dots).
Throughout this manuscript, we refer to the frequency of a gene in PubMed titles and abstracts as its popularity. The most popular yeast gene, ACT1 (coding for actin), appears in >1,400 publications, whereas there are ≈650 genes that appear only once and ≈2,000 genes that never appear in the title or abstract of a publication. These remarkable frequency differences may reflect differences in the importance of genes. In line with this reasoning, the most-studied human genes, such as CD4 and p53, are related to human diseases and thus are of high societal relevance. For yeast genes, however, it is less clear how societal relevance can be determined, and so far no relation has been observed between the popularity of a gene and its importance to cellular processes (20). This indicates that the emergence of highly popular genes is not necessarily driven by importance alone but also by other mechanisms in the praxis of scientific research, such as conventions and trends.
In the following, we develop a stochastic mathematical model to describe the dynamics of references to genes in the literature.
To analyze whether the emergence of highly popular genes might in principle be driven by research trends rather than differences
in the importance of genes, we assume that all genes of a species have the same properties in their publication dynamics.
Two processes are assumed to contribute to the growth of literature on the genes of a species. First, publications on a specific
gene generate additional publications on this gene. Second, publications on one gene generate publications on other genes.
We assume linear kinetics for both processes. A third process is required to initiate growth. We assume that publications
may appear “spontaneously” at a constant rate that does not depend on previous research. The expected number of novel publications
〈ΔP
i,t + 1〉 = 〈P
i,t + 1 − Pi,t〉 on gene i at time point (year) t + 1 is given by:
Pi,t and P
t
* denote the total number of publications on gene i and the average number of publications over all genes of a species, appearing until time point t. Parameters k
1 and k
2 describe the rates at which a publication on a gene gives rise to further publications on other genes and the same gene,
respectively. Parameter k
3 describes the rate of initial spontaneous publications. The parameters k
1, k
2, and k
3 are assumed to be the same for all genes of a species. Thus a model with only three parameters is used to describe the dynamics
of publications on several thousands of genes of a species. The model is related to “preferential attachment” and similar
processes (21, 22). After initial linear growth at rate k
3, the average number of publications per gene grows approximately exponentially at rate k
1 + k
2. Depending on parameters k
1 and k
2, the resulting frequency distribution is characterized by few (k
2 ≪ k
1) or many (k
1 ≪ k
2) highly popular genes.
To describe the saturation phase of the last 10 years, we introduce the term 1/[1+(P
t
*/PS)α]. The parameters PS and α determine after how many publications per gene and how abruptly saturation takes effect. The resulting dynamics is
described by
Results
We use maximum-likelihood estimation to determine the parameters k 1, k 2, k 3, PS, and α from the data and perform a bootstrap analysis to determine 96% confidence intervals. Details are given in Materials and Methods. The estimated rates for exponential growth are k 1 = 0.028 (0.024 < k 1 < 0.034), k 2 = 0.2 (0.18 < k 2 < 0.22), and k 3 = 0.005 (0.003 < k 3 < 0.007). Saturation is described by PS = 8.2 (6.4 < PS < 9.8) and α = 1.2 (1.0 < α < 1.6). The original data and a sample set of simulated data (see Materials and Methods) using the estimated parameters are highly similar (Fig. 1). An initial growth of k 3 = 0.005 indicates that ≈30 genes per year appear spontaneously in the literature. Before saturation takes effect, the average number of gene names in titles and abstracts of publications grows exponentially at a rate of k 1 + k 2 ≈ 0.23. Most of this growth (≈87%) is driven by the mechanism that publications on a specific gene promote further publications on the same gene. Only a small proportion of growth (≈13%) is driven by the alternative mechanism that publications on genes promote further publications on different genes. The estimated rates k 1, k 2, and k 3 are similar to those estimated for the model of exponential growth (Eq. 1) for 1975–1994 [see supporting information (SI) Fig. 4]. The model captures not only the overall growth of publications on yeast genes but also the evolution of the frequency distribution (SI Fig. 4C). Thus for a time range of 20 years, a very simple model with only three parameters gives an excellent description for the temporal patterns of >6,000 gene names in the literature.
Our results indicate that a frequency distribution with highly popular genes may, in principle, emerge, even if there are no differences in the importance of genes. However, they do not imply there are no differences in importance. Models that account for differences in importance may describe the dynamics similarly well. In practice, it is often impossible to determine whether importance plays a role. In our model, for example, the popularity of a gene depends on the time of its first appearance in the literature. The earlier a gene appears, the higher the expected frequency at a future point in time. However, it is difficult to determine whether the order at which genes appear in the literature is random, as in our model, or whether it is driven by importance. Furthermore, besides importance, there might be other factors that favor publications on specific genes. Some genes may be more easily accessible for scientific studies for methodological or experimental reasons. Particularly in early studies, some genes might be favored, because they are consistently expressed at a high level, or because mutations result in a distinctive phenotype. Prior knowledge, for example from studies in other species, may additionally increase the attractiveness of some genes for scientific research. Furthermore, even if trends and conventions play a role in the dynamics of science, this is not necessarily negative. It might be convenient and scientifically justified to always use the same genes or gene products as controls in assays or to use specific selective markers.
The strength of our model is its simplicity. Although scientific research is a highly complex process, our results show that a very simple model can predict frequency patterns of content-related terms, such as genes, in scientific publications. Our model does not rely on quantities such as importance, which are difficult or impossible to quantify in an objective way. Irrespective of the role of importance, our results indicate that the temporal evolution of publications on yeast genes follows a very simple dynamics: New publications are about genes that have been studied frequently in the past. Researchers predominantly publish on popular genes, although they may not necessarily be aware of this.
There is a highly significant positive correlation between the frequency at which a gene appears in the literature and the current impact of the journals in which it appears (Kendall's τ = 0.13, P < 2.2 × 10−16, n = 4,095). In other words, publications on popular genes appear in journals with higher impact (Fig. 2 A). However, the temporal patterns of impact (Fig. 2 B) reveal that, as the field progresses toward saturation, the reward for publishing on popular yeast genes decreases, whereas the reward for publishing on genes that have rarely or never been studied increases. By the end of the 1990s, the impact difference between publications on popular and unpopular genes disappeared. A potential explanation for this finding is that publications on popular genes face increasing competition for high-impact journals, whereas unstudied genes receive increasing interest. The positive correlation between popular journal impact might result from two different effects: If the first publications on a gene appear in a high-impact journal, it may become more attractive and thus more popular in the future. On the other hand, current popularity may have an influence on the impact of future publications. The latter mechanism would imply that at least until the mid-1990s, publishing on popular genes was a rewarding strategy for researchers.
Impact vs. publication frequencies. (A) There is a positive correlation between the frequency at which a gene appears in the literature and the average impact of the journals in which articles on a gene are published. More popular genes are published in journals with higher impact. Genes are binned according to popularity into exponentially increasing intervals. The dots indicate the average journal impact of publications on genes in a bin. The gray lines indicate the standard deviation of the distribution. (B) The difference in the expected impact of the three most popular genes (dots) and of unpopular genes (lower 50%, crosses) decreases in the 1990s and eventually vanishes. Data are averaged over time intervals of 4 years.
To study whether our results described above are specific to S. cerevisiae, or whether they also apply to the publication dynamics of genes of other species, we performed a similar analysis for Drosophila melanogaster, Caenorhabditis elegans, and Homo sapiens. Results are shown in SI Fig. 5. For all species, we observe that k 2 is much larger than k 1, indicating that the growth of the research fields is mainly driven by the growth of research on the popular genes. The publication dynamics of Drosophila and C. elegans genes is very similar to that of yeast genes. Human genes follow a different dynamics. Most importantly, for human genes, our model cannot fully recapture the frequencies of the most popular genes. The assumption that all genes are of equal importance appears to be particularly unjustified for human genes: As mentioned above, the most frequent human genes are disease-related and therefore of high societal relevance. For the most popular genes, such differences in importance translate into an even higher popularity than can be explained by preferential attachment alone. As for S. cerevisiae, we observe a positive correlation between journal impact and popularity for C. elegans and Drosophila genes. Again, human genes differ from genes of other species, in that there is a negative correlation between journal impact and popularity. Given that there are many more publications that contain human genes in their titles or abstracts than there are for other species, it seems plausible that competition for limited space in high-impact journals plays a much larger role.
The model presented above (Eq. 2) is based on the assumption that all genes are equivalent in terms of their publication dynamics. For S. cerevisiae, Drosophila, and C. elegans, this allows the generation of patterns similar to the observed data. However, as discussed above, this does not necessarily imply there are no differences in the importance of genes in these species. Given the good fit, we can use discrepancies between model and data to detect such differences. More specifically, we can test whether a gene appeared at a specific time point in significantly more publications than would be expected based on the model. The event with the most significant deviation from expectation is the appearance of ACT1 in seven publications in 1980. This unexpected burst of publications is related to the sequencing of the yeast actin gene. A list of additional significant publication events is given in SI Table 1. Details on the methods are given in Materials and Methods.
The iHOP text-mining system also allows us to identify interactions among genes described in titles or abstracts from the PubMed database (19). More specifically, iHOP recognizes sentences of the pattern “gene/protein–verb–gene/protein,” where the verb indicates a physical interaction, such as “bind” or “interact.” For yeast, the resulting interaction network contains ≈6,500 unique interactions. The connectivity distribution of this network is dominated by a few highly connected genes (Fig. 3 A). This distribution mainly results from the differences in the popularity of genes; there is a highly significant correlation between the frequency of a gene in the literature and the number of unique interactions obtained from the literature (Kendall's τ = 0.6, P < 2.2 × 10−16, n = 4,171). Although identifying a large number of interaction partners requires that a gene be well studied, it is, in our view, surprising that over 3 orders of magnitude, the frequency at which a gene appears in the literature is a strong predictor of the number of unique interactions reported for this gene.
The impact of popularity on the gene interaction network as obtained from PubMed data. (A) The double-logarithmic plot of the connectivity distribution indicates the presence of “hubs” with a large number of interaction partners reported in titles or abstracts of publications. (B) Popularity vs. connectivity. There is a strong correlation between the frequency at which a gene appears in the literature and the number of interactions reported in the literature. Thus, the bias in the connectivity distribution shown in A is driven by popularity rather than the presence of hubs in the underlying physical interaction network. (C) Interactions per publication. For genes with at least one interaction in the literature network, the number of interactions normalized by the number of publications follows approximately a log-normal distribution. The genes with the highest number of interactions per publication typically appear in only a single abstract that describes several interactions, i.e., the strongest deviation in distribution arises for genes that are not very well “sampled.” Genes that appear at a higher frequency show a smaller variance in the number of interactions per publication. This is consistent with a rather homogeneous connectivity distribution of the underlying network.
There are three hypotheses to explain this observation. First, genes with many interaction partners tend to become more popular. Second, most genes have a large number of interaction partners, and the more research is done, the more interaction partners are identified. Third, the more popular a gene is, the more false positives are published. All three hypotheses have interesting consequences. The first hypothesis, in our view, is the most implausible; when a gene initially attracts attention in the research community, it is not known how many interaction partners it has. However, early identification of many interaction partners might make a gene more interesting for the research community and may lead to additional publications. The second hypothesis is more plausible. More research on a gene is expected to lead to the identification of more interaction partners. Given that the correlation between popularity and number of interaction partners holds over several orders of magnitude, the second hypothesis implies that for any gene, an arbitrarily high number of interaction partners can be identified, if enough research is performed. This is in contrast to the prevailing view that the number of interactions of relevance for the functioning of a biological system is not arbitrarily high. The third hypothesis implies that interactions of more popular genes tend to be less reliable, i.e., the fraction of false findings among published interactions increases with increasing popularity. This is in line with theoretical predictions on the reliability of published research as outlined in a recent controversial study by J. P. A. Ioannidis (23). In contrast to the first hypothesis, the second and third hypotheses imply that hubs in the literature are not necessarily hubs in the underlying biochemical networks. Thus, they question the common view that biochemical networks are “scale-free”, which implies the presence of hubs. At present, we cannot distinguish among the three hypotheses. Given the importance of this point for the research community, this remains a highly interesting question for future studies.
Hubs have been observed in interaction networks derived from high-throughput methods (24–27). Although in contrast to the literature network, these networks are not affected by a “popularity bias,” they also do not give an unbiased picture of the underlying biochemical network. Factors such as expression level or function generate a bias on the number of observed interaction partners (28). Given that the overlap between different high-throughput methods is relatively small (28), and the correlations between the connectivity of a gene as obtained by one vs. another unrelated method are weak (29), it is questionable whether high-throughput methods give a reliable picture for the presence of hubs in the underlying biochemical network. Our findings allow correction of the literature network for the “popularity bias” and illustrate that, when studying statistical properties of networks, it is essential to correct for biases that may arise from the methods used to generate the network. Results from high-throughput mass spectroscopy or tandem affinity purification, for example, should be corrected for expression levels. This is not always done; research appears to be influenced by the high popularity of “small-world” networks.
Discussion
It has been recognized that sociological processes such as trends and conventions play a role in the progression of science (1). Our approach of using a large-scale data set on the dynamics of content-related terms in the scientific literature is a first step to quantitatively describing the mechanisms involved in this process. We show that researchers predominately publish on genes that already appear at high frequency in the literature. This process leads to a frequency distribution of genes in scientific publications that resembles a power law. It has been argued that a similar process contributes to the frequency distribution of citations of papers: researchers predominately cite papers cited by other papers, partially because they search the literature recursively, and because they copy references from other papers (14, 16, 30). As for the genes in our analysis, the popularity of research papers is driven not only by importance but also by social processes.
Given that journal impact is often used to evaluate researchers, the positive correlation between the popularity of a gene and the journal impact of its publications may indicate that publishing on more popular genes is a rewarding strategy. However, there may also be strategic disadvantages associated with performing research on popular genes. The chance that competitors perform research on the same question can be expected to be greater for popular than for unpopular genes. Competition for the limited space in high-impact journals might be stronger, because for more popular genes, a larger number of publications on similar questions might be submitted. Furthermore, it might be more difficult to convince reviewers that a contribution on a popular gene adds sufficient novel findings to the existing body of knowledge. In contrast, research on novel or unpopular genes represents a strategy with higher risks but also higher potential outcome. If successful, research on novel genes might be perceived as important pioneer work. The optimal strategy of a researcher may depend on his/her career stage. It might be a safe strategy for a young researcher to work with established scientists on established topics. On the other hand, at some point, it is important for a researcher to be perceived as independent and to associate his/her name with a novel research topic. Furthermore, our results indicate that also the stage of the research field influences the success of a research strategy. Novel research topics seem particularly advantageous in the phase of saturation.
It is unclear whether researchers are able to determine optimal research strategies, and whether they indeed choose their research topics accordingly. Even if researchers use strategies that are optimized under the costs and benefits described above, it is questionable whether their behavior optimizes the way knowledge is established. As illustrated for the literature interaction network, differences in popularity may translate into potentially problematic biases in the research field. Furthermore, researchers may be under pressure to popularize their findings at least within the research community, which may facilitate overinterpretation of results. It therefore remains a very challenging task to make sure that the interests of individual researchers are not at odds with the interests of the research community.
Materials and Methods
In the following, we describe the procedures for estimating parameters from the data, simulating the process, and determining unexpected publication events.
The expected number of publications on gene i at time t, 〈Δpi,t〉, is given by Eq. 2. The observed number of publications on gene i at time t is denoted by Δp i,t. We assume that the number of observed publications follows a Poisson distribution given by f(λ; n) = e − λ λn/n!. The likelihood L(k 1, k 2, k 3, PS, α) of the data is given by the product L(k 1, k 2, k 3, PS, α) = ∏ f(λ = 〈Δpi,t〉; n = Δpi,t) over all genes i = 0… N and all time points t = t 0 … t max. We assume there are 6,200 yeast genes and thereby account for ≈2,000 genes that have never appeared in a title or an abstract. We use annual data as obtained from PubMed. Because the entire publication history is required, in principle, for calculating the expected number of publications(see Eq. 2), we use the first 4 years (1975–1978) as an approximation of the initial publication history and maximize the likelihood L over the time span from t 0 = 1979 to t max = 2005. We furthermore exclude all genes that appear >20 times in the first 4 years, because these genes likely have a considerable pre-1975 history. (For yeast, this applies to five genes: PGK1, ADH1, COB, HXK1, and PFK1. All these genes code for major enzymes in yeast metabolism, a topic that has a considerably longer history that yeast genetics.) To determine the maximum-likelihood estimators of the parameters, we numerically maximize log(L(k 1, k 2, k 3, PS, α)) using the R function optim. To estimate confidence intervals, we generate 250 data sets by sampling (with replacement) a number of genes from the original data set and reestimate the parameters as described above.
We simulate the data set by subsequently calculating the expected number of publications 〈Δp i,t〉 for each gene at each year (Eq. 2) and then generate Δpi,t by drawing a random number from a Poisson distribution with λ = 〈Δp i,t〉. Again, we use the first 4 years as input and simulate the process for the time span from 1979 to 2005.
To test whether a gene at a specific year appears at a significantly higher frequency than expected based on the model (Eq. 2), we calculate the probability P that for an expected number of publications 〈pi,t〉, the number of publications is equal to or larger than the observed number of publications pi,t. The list of all events with P < 0.05/nT is given in SI Table 1. We perform tests only for the time span from 1979 to 2005. (nT denotes the number of tests we perform and is 167,400 = 27 years × 6,200 genes. The P value given above corresponds to a conservative Bonferroni correction.) Calculations for the three-parameter model (SI Fig. 4) are done analogously, using Eq. 1 instead of Eq. 2.
Acknowledgments
We thank C. T. Bergstrom, B. Kerr, J. West, F. Taddei, and R. May for inspiring discussions and helpful comments. We gratefully acknowledge support from Society in Science/The Branco Weiss Fellowship.
Footnotes
- ‡To whom correspondence should be addressed. E-mail: pfeiffer{at}fas.harvard.edu
-
Author contributions: T.P. designed research; T.P. and R.H. performed research; T.P. and R.H. analyzed data; and T.P. wrote the paper.
-
The authors declare no conflict of interest.
-
This article is a PNAS Direct Submission.
-
This article contains supporting information online at www.pnas.org/cgi/content/full/0701315104/DC1.
- © 2007 by The National Academy of Sciences of the USA








