Skip to main content

Main menu

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home
  • Log in
  • My Cart

Advanced Search

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
Research Article

Patterns of text reuse in a scientific corpus

Daniel T. Citron and Paul Ginsparg
  1. Departments of aPhysics and
  2. bInformation Science, Cornell University, Ithaca, NY 14853

See allHide authors and affiliations

PNAS first published December 8, 2014; https://doi.org/10.1073/pnas.1415135111
Daniel T. Citron
Departments of aPhysics and
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Paul Ginsparg
Departments of aPhysics and
bInformation Science, Cornell University, Ithaca, NY 14853
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: ginsparg@cornell.edu
  1. Edited* by William H. Press, University of Texas at Austin, Austin, TX, and approved November 6, 2014 (received for review August 7, 2014)

  • Article
  • Figures & SI
  • Info & Metrics
  • PDF
Loading

Significance

In the modern electronic format it is both easier to reuse text and easier to detect reused text. This is the first comprehensive study of patterns of text reuse within the full texts of an important large scientific corpus, covering a 20-y timeframe. It provides an important baseline for what is regarded as standard practice within the affected research communities, a standard somewhat more lenient than that currently applied to journalists, popular authors, and public figures.

Abstract

We consider the incidence of text “reuse” by researchers via a systematic pairwise comparison of the text content of all articles deposited to arXiv.org from 1991 to 2012. We measure the global frequencies of three classes of text reuse and measure how chronic text reuse is distributed among authors in the dataset. We infer a baseline for accepted practice, perhaps surprisingly permissive compared with other societal contexts, and a clearly delineated set of aberrant authors. We find a negative correlation between the amount of reused text in an article and its influence, as measured by subsequent citations. Finally, we consider the distribution of countries of origin of articles containing large amounts of reused text.

  • arXiv
  • plagiarism
  • text mining
  • n-grams

Footnotes

  • ↵1To whom correspondence should be addressed. Email: ginsparg{at}cornell.edu.
  • Author contributions: P.G. designed research; D.T.C. and P.G. performed research; D.T.C. and P.G. analyzed data; and D.T.C. and P.G. wrote the paper.

  • The authors declare no conflict of interest.

  • ↵*This Direct Submission article had a prearranged editor.

  • †Commercial resources, such as Ithenticate, use a much larger dataset. See in particular CrossCheck (13), implementing Ithenticate for research publications, and used by member publishers to screen journal submissions (14, 15). That coverage is still far from as comprehensive as that available via commercial search engines, as assessed by comparing to results from the Google custom search API.

  • ‡The number of article pairs with at least 10 or more 7-grams in common is of order 600,000, about 2 per million of the total possible (757,000)2/2≈278 billion total article pairs.

  • §Recall that the vast majority of arXiv submissions appear in the conventional peer-reviewed literature, with the primary exceptions being theses, conference proceedings, lectures, and other “review-type” materials discussed earlier (and excluded from subsequent analysis).

  • ¶Review articles pose an additional challenge, because standard software used to include pdf figures from other articles sometimes carries along ”hidden” text surrounding the figure from its original context, invisible to the author and reader in the new context but nonetheless seen by the pdf-to-text converter and flagged as a large text overlap.

  • ‖This happened historically when users inadvertently created a submission with a new identifier rather than using the replace function to create a new version of an existing submission, with the same identifier. This problem has been largely eliminated by the daily overlap screening, with submitters now instructed to replace an existing submission if excessive overlap is detected.

  • ††As discussed earlier, there is no systematic scan for text copied from sources outside of arXiv, and no attempt to detect “plagiarism” as more generally defined, as unattributed use of ideas independent of copied text. The exceptions described earlier for review articles, theses, conference proceedings, book contributions, multipart articles, and so on, are respected, so that common-authored overlaps are not flagged in cases that seem to be accepted as common practice.

  • ‡‡After the completion of this work, we discovered significantly higher rates of text reuse specifically in computer science articles published in predatory open-access journals (articles largely received after the mid-2012 timeframe of the dataset analyzed here). We defer to any later work a more discipline-specific assessment of the issues.

  • This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1415135111/-/DCSupplemental.

Next
Back to top
Article Alerts
Email Article

Thank you for your interest in spreading the word on PNAS.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Patterns of text reuse in a scientific corpus
(Your Name) has sent you a message from PNAS
(Your Name) thought you would like to see the PNAS web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Patterns of text reuse in a scientific corpus
Daniel T. Citron, Paul Ginsparg
Proceedings of the National Academy of Sciences Dec 2014, 201415135; DOI: 10.1073/pnas.1415135111

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
Patterns of text reuse in a scientific corpus
Daniel T. Citron, Paul Ginsparg
Proceedings of the National Academy of Sciences Dec 2014, 201415135; DOI: 10.1073/pnas.1415135111
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Mendeley logo Mendeley

See related content:

  • How many scientific papers are not original?
    - Dec 23, 2014
Proceedings of the National Academy of Sciences: 118 (9)
Current Issue

Submit

Sign up for Article Alerts

Jump to section

  • Article
  • Figures & SI
  • Info & Metrics
  • PDF

You May Also be Interested in

Setting sun over a sun-baked dirt landscape
Core Concept: Popular integrated assessment climate policy models have key caveats
Better explicating the strengths and shortcomings of these models will help refine projections and improve transparency in the years ahead.
Image credit: Witsawat.S.
Model of the Amazon forest
News Feature: A sea in the Amazon
Did the Caribbean sweep into the western Amazon millions of years ago, shaping the region’s rich biodiversity?
Image credit: Tacio Cordeiro Bicudo (University of São Paulo, São Paulo, Brazil), Victor Sacek (University of São Paulo, São Paulo, Brazil), and Lucy Reading-Ikkanda (artist).
Syrian archaeological site
Journal Club: In Mesopotamia, early cities may have faltered before climate-driven collapse
Settlements 4,200 years ago may have suffered from overpopulation before drought and lower temperatures ultimately made them unsustainable.
Image credit: Andrea Ricci.
Steamboat Geyser eruption.
Eruption of Steamboat Geyser
Mara Reed and Michael Manga explore why Yellowstone's Steamboat Geyser resumed erupting in 2018.
Listen
Past PodcastsSubscribe
Birds nestling on tree branches
Parent–offspring conflict in songbird fledging
Some songbird parents might improve their own fitness by manipulating their offspring into leaving the nest early, at the cost of fledgling survival, a study finds.
Image credit: Gil Eckrich (photographer).

Similar Articles

Site Logo
Powered by HighWire
  • Submit Manuscript
  • Twitter
  • Facebook
  • RSS Feeds
  • Email Alerts

Articles

  • Current Issue
  • Special Feature Articles – Most Recent
  • List of Issues

PNAS Portals

  • Anthropology
  • Chemistry
  • Classics
  • Front Matter
  • Physics
  • Sustainability Science
  • Teaching Resources

Information

  • Authors
  • Editorial Board
  • Reviewers
  • Subscribers
  • Librarians
  • Press
  • Site Map
  • PNAS Updates
  • FAQs
  • Accessibility Statement
  • Rights & Permissions
  • About
  • Contact

Feedback    Privacy/Legal

Copyright © 2021 National Academy of Sciences. Online ISSN 1091-6490