# Three dimensions of scientific impact

^{a}Faculty of Physics, Warsaw University of Technology, 00-662 Warsaw, Poland;^{b}Systems Research Institute, Polish Academy of Sciences, 01-447 Warsaw, Poland;^{c}Faculty of Mathematics and Information Science, Warsaw University of Technology, 00-662 Warsaw, Poland;^{d}School of Information Technology, Deakin University, Geelong, VIC 3220, Australia

See allHide authors and affiliations

Edited by Anthony Van Raan, Leiden University, Leiden, The Netherlands, and accepted by Editorial Board Member Adrian E. Raftery May 11, 2020 (received for review January 18, 2020)

## Significance

What are the mechanisms behind one’s research success as measured by one’s papers’ citability? By acknowledging the perceived esteem might be a consequence not only of how valuable one’s works are but also of pure luck, we arrived at a model that can accurately recreate a citation record based on just three parameters: the number of publications, the total number of citations, and the degree of randomness in the citation patterns. As a by-product, we show that a single index will never be able to embrace the complex reality of the scientific impact. However, three of them can already provide us with a reliable summary.

## Abstract

The growing popularity of bibliometric indexes (whose most famous example is the *h* index by J. E. Hirsch [J. E. Hirsch, *Proc. Natl. Acad. Sci. U.S.A.* 102, 16569–16572 (2005)]) is opposed by those claiming that one’s scientific impact cannot be reduced to a single number. Some even believe that our complex reality fails to submit to any quantitative description. We argue that neither of the two controversial extremes is true. By assuming that some citations are distributed according to the rich get richer rule (success breeds success, preferential attachment) while some others are assigned totally at random (all in all, a paper needs a bibliography), we have crafted a model that accurately summarizes citation records with merely three easily interpretable parameters: productivity, total impact, and how lucky an author has been so far.

Ever since Garfield’s (1) impact factor for journals and Hirsch’s (2) h index for individual researchers, the popularity of bibliometric impact measures has been growing rapidly. The fact that they summarize one’s scientific performance with just a single number is appealing to many. However, some argue (3) that the nature of scientific activities is too multidimensional for such a simple description to be possible and a few quantitative metrics will never be sufficient to capture this complex reality in its entirety.

In this paper we address this issue from the perspective of the increasingly popular science of science (Sci-Sci) (4, 5) approach, which can be dated back to the classical book by de Solla Price, *Little Science, Big Science* (6). The modern Sci-Sci utilizes complex systems methodology and can be considered a fusion of agent-based modeling and big data analysis.

We have developed a model of an author’s research activity that is based on two simple assumptions: 1) In each time step one new paper is added into the simulation. 2) Each newly added paper cites the existing publications according to a combination of a) the preferential attachment rule—highly cited papers are more likely to attract even more citations [compare the rich get richer mechanism (7), the success breeds success phenomenon (8), and the effect of a scientist’s reputation (9)]—and b) sheer chance—papers might be discovered by the citing authors by accident or be included in the bibliography completely at random.

While the importance of the rich get richer rule (7) in bibliometrics is unquestionable [first part of Merton’s (10) Matthew effect, referred to as the cumulative advantage process by de Solla Price (8) or success-breeds-success phenomenon (6, 11), confirmed experimentally (12)], we argue here that a purely preferential model is incapable of explaining our reality well enough and the accidental component is necessary (13, 14).

Furthermore, in our case we adopt different levels of analysis [as known from social sciences (15)] (Fig. 1) for generated bibliometric data. Agent-based models are formulated at the microlevel—from the perspective of an individual paper. The Sci-Sci perspective usually investigates the structure of the citation network in its entirety, for instance to describe general citation patterns across the whole scientific discipline (macrolevel). Here we are mainly focusing on the rarely considered mesolevel (Table 1), which is the perspective of a single scientist, i.e., a small-sample one. As such, the above publication–citation process can be thought of as an extension of the iterative procedure known as the Ionescu–Chopard model (16, 17) (*Materials and Methods*, *Model Description*).

## Model Derivation

Assume

Our model, on the other hand, not only has a clear interpretation (recall the two simple assumptions above), but also provides high-accuracy approximations of citation records of individuals. Due to this, we are able to describe this complex reality with merely three self-explanatory parameters: the number of papers N; the total number of citations

For the derivation of the model please refer to *Materials and Methods*, *Model Description*. The citation process proposed above, after all of the N papers have been published and all of the citations have been distributed, yields the following analytic formula for the estimated number of citations of the kth most cited paper (*Materials and Methods*, *Exact Solution of the Model*):

## Dataset Description

To demonstrate the usefulness of the model, we study the DBLP Computer Science Bibliography (47) dataset of computer science papers; see *Materials and Methods*, *Data Availability* for description. We consider citation records of all 123,621 scholars whose h index is at least 5. To determine the three model parameters characterizing each author, we omit the papers with no citations (as overfitting to a tail composed of zeros cannot lead to a good overall description). Then we compute the author’s N (number of papers that were cited at least once) and C (the total number of citations) and then estimate ρ using the least-squares fit with respect to the Cauchy loss

Once we obtain an author’s N, C, and ρ, we can reproduce the author’s citation record quite accurately (Fig. 2). The high variance of ρ for each fixed N and C (Fig. 3) indicates that this parameter is necessary for a precise description of data. This suggests that indeed the modeled reality might be three-dimensional (3D), which roughly agrees with the estimates in ref. 48.

## Results and Discussion

It turns out that ca.

By indicating that the citation record space is 3D, we have proved that any single citation measure, including the h index and the author’s ranking it generates, necessarily yields an oversimplified projection of a more complex space (3). In other words, whenever one chooses a single citation index, some information must inherently be lost; we will never be able to see the whole picture through the lenses of any single measure.

The proposed model emphasizes the use of multiple indexes in the evaluation of scientific work. We have indicated that merely three parameters are sufficient to provide an accurate description of our reality. In the near future, we plan to perform a broad study of bibliometric indexes to come up with an intuitive and insightful classification for which of the three dimensions each index focuses on the most. This will allow policy makers to make better-informed decisions when choosing particular evaluation tools. The questions of how to best combine N, C, and ρ to cause the least information loss and how well popular citation indexes perform with regard to the quality of data approximation will also be explored.

## Materials and Methods

### Model Description.

Let us introduce the proposed model in a formal manner. For the description of the citation dynamics we use the following parameters: the total number of papers N, the total number citations C that will be distributed among all papers, and ratio of the number of preferential citations to the total number of citations

Due to the assumed boundary conditions in Eq. **3**, we disallow both

The stages of the model’s simulation are strictly connected to the scientific activity of the considered author. Each of the N steps corresponds to the publication of one of the author’s papers. At the tth step, the t articles already in existence are to receive

Note that both

The rate equation for the number of citations of the kth mostly cited paper at the tth stage of the simulation, **2**, i.e., the preferential part, we assume that accidental citations are distributed first to avoid singularities with the very natural boundary conditions of the form given by Eq. **3**. This explains the occurrence of

### Exact Solution of the Model.

Below we derive the exact formula for **2** can be simplified as**5** as**4** of the form given by Eq. **7**, we obtain**8** we can stop the nesting procedure by using the boundary conditions given by Eq. **3**. The final formula for **11**. Due to Eq. **6**, we can substitute the gamma functions with the following product:**11** yields

### Data Availability.

Empirical data analysis conveyed in this paper is based on the DBLP V10 bibliography database (47) (https://aminer.org/citation), consisting of 3,079,007 papers and 25,16,994 citation relationships. DBLP includes most of the journals related to computer science. It also tracks numerous conference proceedings papers from the field.

We have extracted citation records of 1,762,044 authors. Most of them have published a small number of papers or have received very few citations. Therefore, we restricted the analysis to the subset of researchers characterized by the h index not less than 5. This gave 123,621 citation records. Moreover, papers with 0 citations have been omitted from the analysis, as they are problematic when performing computations on the log scale. Note that most impact indexes, including the h index, ignore zeros anyway.

The raw citation sequences, estimated parameters, and source code used to perform the data analysis can be accessed at the GitHub repository: https://github.com/gagolews/three_dimensions_of_scientific_impact (51).

## Acknowledgments

We thank Maciej J. Mrowiński, Tessa Koumoundouros, and the reviewers for valuable feedback and constructive remarks.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: grzegorz.siudem{at}pw.edu.pl.

Author contributions: G.S., B.Ż.-S., A.C., and M.G. designed research; G.S., B.Ż.-S., A.C., and M.G. performed research; B.Ż.-S., A.C., and M.G. analyzed data; and G.S., B.Ż.-S., A.C., and M.G. wrote the paper.

The authors declare no competing interest.

This article is a PNAS Direct Submission. A.V.R. is a guest editor invited by the Editorial Board.

Data deposition: The raw citation sequences, estimated parameters, and source code used to perform the data analysis can be accessed at the GitHub repository: https://github.com/gagolews/three_dimensions_of_scientific_impact.

Published under the PNAS license.

## References

- ↵
- E. Garfield

- ↵
- J. E. Hirsch

- ↵
- M. Gagolewski

- ↵
- A. Clauset,
- D. B. Larremore,
- R. Sinatra

- ↵
- S. Fortunato et al.

- ↵
- D. J. de Solla Price

- ↵
- ↵
- ↵
- A. M. Petersen et al.

- ↵
- R. K. Merton

- ↵
- ↵
- A. van de Rijt,
- S. M. Kang,
- M. Restivo,
- A. Patil

- ↵
- ↵
- ↵
- H. M. Blalock

- ↵
- G. Ionescu,
- B. Chopard

- ↵
- B. Żogała-Siudem,
- G. Siudem,
- A. Cena,
- M. Gagolewski

- ↵
- ↵
- L. Egghe

- ↵
- K. Sangwal

- ↵
- M. Thelwall

- ↵
- F. Radicchi,
- S. Fortunato,
- C. Castellano

- S. Redner

- M. L. Wallace,
- V. Larivière,
- Y. Gingras

- M. Brzezinski

- T. Fenner,
- M. Levene,
- G. Loizou

- M. Thelwall

- M. Thelwall,
- P. Wilson

- M. Thelwall

- A. L. Barabási

- S. Thurner,
- F. Kyriakopoulos,
- C. Tsallis

- E. A. Leicht,
- G. Clarkson,
- K. Shedden,
- M. E. Newman

- A. Barabási et al.

- A. L. Barabási,
- R. Albert,
- H. Jeong

- Z. G. Shao,
- X. W. Zou,
- Z. J. Tan,
- Z. Z. Jin

- Z. G. Shao,
- T. Chen,
- B.-q. Ai

- M. L. Goldstein,
- S. A. Morris,
- G. G. Yen

- Z. X. Wu,
- P. Holme

- Z. Xie,
- Z. Ouyang,
- P. Zhang,
- D. Yi,
- D. Kong

- L. Zalányi et al.

- M. Golosovsky,
- S. Solomon

- ↵
- J. Tang et al.

- ↵
- J. R. Clough,
- T. S. Evans

- ↵
- R. Heesen

- ↵
- F. W. J. Olver, et al.

- ↵
- G. Siudem,
- B. Żogała-Siudem,
- A. Cena,
- M. Gagolewski

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Applied Mathematics

- Social Sciences
- Social Sciences