Skip to main content

Main menu

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home
  • Log in
  • My Cart

Advanced Search

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
Letter

Note on the quadratic penalties in elastic weight consolidation

Ferenc Huszár
  1. aTwitter, London W1B 5AG, United Kingdom

See allHide authors and affiliations

PNAS March 13, 2018 115 (11) E2496-E2497; first published February 20, 2018; https://doi.org/10.1073/pnas.1717042115
Ferenc Huszár
aTwitter, London W1B 5AG, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Article
  • Figures & SI
  • Info & Metrics
  • PDF
Loading

Catastrophic forgetting is an undesired phenomenon which occurs when neural networks are trained on different tasks sequentially. Elastic weight consolidation (EWC; ref. 1), published in PNAS, is a novel algorithm designed to safeguard against this. Despite its satisfying simplicity, EWC is remarkably effective.

Motivated by Bayesian inference, EWC adds quadratic penalties to the loss function when learning a new task. The purpose of penalties is to approximate the loss surface from previous tasks. The authors derive the penalty for the two-task case and then extrapolate to handling multiple tasks. I believe, however, that the penalties for multiple tasks are applied inconsistently.

In ref. 1 a separate penalty is maintained for each task T, centered at θT∗, the value of θ obtained after training on task T. When these penalties are combined (assuming λT=1), the aggregate penalty is anchored atμT=(FA+FB…+FT)−1(FAθA∗+FBθB∗…+FTθT∗).From the third task onward this is inconsistent with Bayesian inference. In the Bayesian paradigm the posterior p(θ |DA,DB) encapsulates the agent’s experience in both tasks A and B, thus rendering the previous posterior p(θ |DA) irrelevant. Analogously, as θB∗ was obtained while incorporating the penalty around θA∗, once we have θB∗, θA∗ is not needed anymore.

The correct form of penalties can be obtained by recursive application of the two-task derivation (2). It turns out that a single penalty is sufficient; its center is the latest optimum θT∗ and its weights are given by the sum of diagonal Fisher information matrices from previous tasks FA+FB+…+FT. This single penalty version is akin to Bayesian online learning (3).

If the agent has to revisit data from past training episodes, multiple penalties should be maintained similarly to expectation propagation (4, 5). The aggregate penalty should be centered at θT∗ rather than μT. The anchor point for task T’s penalty should therefore beθ∼T=FT−1((FA+FB+…FT)θT∗−FAθ∼A−…−FSθ∼S)rather than θT∗, where tasks A…S precede task T, and θ∼A…θ∼S are the respective penalty centers for these tasks.

To illustrate the behavior and effect of different penalties I applied EWC to a sequence of linear regression tasks. The tasks define quadratic losses each with a diagonal Hessian (Fig. 1, Left). As all simplifying assumptions of EWC hold exactly, one should expect it to match exact Bayesian inference: Reach the global minimum of the combined loss and do so irrespective of the order in which tasks were presented. Fig. 1, Center shows that although a quadratic penalty could model task B perfectly the penalty is placed around the wrong anchor point. Table 1 shows that this leads to suboptimal performance and an unwanted sensitivity to task ordering. By contrast, EWC using the corrected penalties around θ∼T models the losses perfectly (Fig. 1, Right) and achieves optimal performance in an order-agnostic fashion (Table 1).

Fig. 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 1.

(Left) Elliptical level sets of quadratic loss functions for tasks A, B, and C also used in Table 1. (Center) When learning task C via EWC, losses for tasks A and B are replaced by quadratic penalties around θA* and θB*. (Right) Losses are approximated perfectly by the correct quadratic penalties around θ∼A=θA∗ and θ∼B.

View this table:
  • View inline
  • View popup
Table 1.

Final performance of a model trained via different versions of EWC on a sequence of three linear regression tasks

I expect the negative impact of using incorrect penalties to be negligible until the network’s capacity begins to saturate. Furthermore, optimizing λ can compensate for the effects to some degree. As a result, it is possible that using incorrect penalties would result in no observable performance degradation in practice. However, since the correct penalties are just as easy to compute, and are clearly superior in some cases, I see no reason against adopting them.

Footnotes

  • ↵1Email: fhuszar@twitter.com.
  • Author contributions: F.H. designed research, performed research, and wrote the paper.

  • The author declares no conflict of interest.

Published under the PNAS license.

References

  1. ↵
    1. Kirkpatrick J, et al.
    (2017) Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci USA 114:3521–3526.
    OpenUrlAbstract/FREE Full Text
  2. ↵
    1. Huszár F
    (2017) On quadratic penalties in elastic weight consolidation. arXiv:1712.03847.
  3. ↵
    1. Saad D
    1. Opper M
    (1998) A Bayesian approach to on-line learning. On-Line Learning in Neural Networks, ed Saad D (Cambridge Univ Press, Cambridge, UK), pp 363–378.
  4. ↵
    1. Breese J,
    2. Koller D
    1. Minka TP
    (2001) Expectation propagation for approximate Bayesian inference. Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, eds Breese J, Koller D (Assoc for Computing Machinery, New York), pp 362–369.
  5. ↵
    1. Thrun S,
    2. Saul LK,
    3. Schölkopf B
    1. Eskin E,
    2. Smola AJ,
    3. Vishwanathan S
    (2004) Laplace propagation. Advances in Neural Information Processing Systems 16, eds Thrun S, Saul LK, Schölkopf B (MIT Press, Cambridge, MA), pp 441–448.
PreviousNext
Back to top
Article Alerts
Email Article

Thank you for your interest in spreading the word on PNAS.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Note on the quadratic penalties in elastic weight consolidation
(Your Name) has sent you a message from PNAS
(Your Name) thought you would like to see the PNAS web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Note on the quadratic penalties in elastic weight consolidation
Ferenc Huszár
Proceedings of the National Academy of Sciences Mar 2018, 115 (11) E2496-E2497; DOI: 10.1073/pnas.1717042115

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
Note on the quadratic penalties in elastic weight consolidation
Ferenc Huszár
Proceedings of the National Academy of Sciences Mar 2018, 115 (11) E2496-E2497; DOI: 10.1073/pnas.1717042115
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Mendeley logo Mendeley

Article Classifications

  • Biological Sciences
  • Neuroscience
  • Physical Sciences
  • Applied Mathematics

This Letter has a Reply and related content. Please see:

  • Overcoming catastrophic forgetting in neural networks - March 14, 2017
  • Relationship between Letter and Reply - February 20, 2018
Proceedings of the National Academy of Sciences: 115 (11)
Table of Contents

Submit

Sign up for Article Alerts

Jump to section

  • Article
    • Footnotes
    • References
  • Figures & SI
  • Info & Metrics
  • PDF

You May Also be Interested in

Setting sun over a sun-baked dirt landscape
Core Concept: Popular integrated assessment climate policy models have key caveats
Better explicating the strengths and shortcomings of these models will help refine projections and improve transparency in the years ahead.
Image credit: Witsawat.S.
Model of the Amazon forest
News Feature: A sea in the Amazon
Did the Caribbean sweep into the western Amazon millions of years ago, shaping the region’s rich biodiversity?
Image credit: Tacio Cordeiro Bicudo (University of São Paulo, São Paulo, Brazil), Victor Sacek (University of São Paulo, São Paulo, Brazil), and Lucy Reading-Ikkanda (artist).
Syrian archaeological site
Journal Club: In Mesopotamia, early cities may have faltered before climate-driven collapse
Settlements 4,200 years ago may have suffered from overpopulation before drought and lower temperatures ultimately made them unsustainable.
Image credit: Andrea Ricci.
Steamboat Geyser eruption.
Eruption of Steamboat Geyser
Mara Reed and Michael Manga explore why Yellowstone's Steamboat Geyser resumed erupting in 2018.
Listen
Past PodcastsSubscribe
Birds nestling on tree branches
Parent–offspring conflict in songbird fledging
Some songbird parents might improve their own fitness by manipulating their offspring into leaving the nest early, at the cost of fledgling survival, a study finds.
Image credit: Gil Eckrich (photographer).

Similar Articles

Site Logo
Powered by HighWire
  • Submit Manuscript
  • Twitter
  • Facebook
  • RSS Feeds
  • Email Alerts

Articles

  • Current Issue
  • Special Feature Articles – Most Recent
  • List of Issues

PNAS Portals

  • Anthropology
  • Chemistry
  • Classics
  • Front Matter
  • Physics
  • Sustainability Science
  • Teaching Resources

Information

  • Authors
  • Editorial Board
  • Reviewers
  • Subscribers
  • Librarians
  • Press
  • Site Map
  • PNAS Updates
  • FAQs
  • Accessibility Statement
  • Rights & Permissions
  • About
  • Contact

Feedback    Privacy/Legal

Copyright © 2021 National Academy of Sciences. Online ISSN 1091-6490