Skip to main content
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian
  • Log in
  • My Cart

Main menu

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home

Advanced Search

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses

New Research In

Physical Sciences

Featured Portals

  • Physics
  • Chemistry
  • Sustainability Science

Articles by Topic

  • Applied Mathematics
  • Applied Physical Sciences
  • Astronomy
  • Computer Sciences
  • Earth, Atmospheric, and Planetary Sciences
  • Engineering
  • Environmental Sciences
  • Mathematics
  • Statistics

Social Sciences

Featured Portals

  • Anthropology
  • Sustainability Science

Articles by Topic

  • Economic Sciences
  • Environmental Sciences
  • Political Sciences
  • Psychological and Cognitive Sciences
  • Social Sciences

Biological Sciences

Featured Portals

  • Sustainability Science

Articles by Topic

  • Agricultural Sciences
  • Anthropology
  • Applied Biological Sciences
  • Biochemistry
  • Biophysics and Computational Biology
  • Cell Biology
  • Developmental Biology
  • Ecology
  • Environmental Sciences
  • Evolution
  • Genetics
  • Immunology and Inflammation
  • Medical Sciences
  • Microbiology
  • Neuroscience
  • Pharmacology
  • Physiology
  • Plant Biology
  • Population Biology
  • Psychological and Cognitive Sciences
  • Sustainability Science
  • Systems Biology
Letter

Note on the quadratic penalties in elastic weight consolidation

Ferenc Huszár
PNAS March 13, 2018 115 (11) E2496-E2497; first published February 20, 2018; https://doi.org/10.1073/pnas.1717042115
Ferenc Huszár
aTwitter, London W1B 5AG, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site

This Letter has a Reply and related content. Please see:

  • Overcoming catastrophic forgetting in neural networks - March 14, 2017
  • Reply to Huszár: The elastic weight consolidation penalty is empirically valid - February 20, 2018
  • Article
  • Figures & SI
  • Info & Metrics
  • PDF
Loading

Catastrophic forgetting is an undesired phenomenon which occurs when neural networks are trained on different tasks sequentially. Elastic weight consolidation (EWC; ref. 1), published in PNAS, is a novel algorithm designed to safeguard against this. Despite its satisfying simplicity, EWC is remarkably effective.

Motivated by Bayesian inference, EWC adds quadratic penalties to the loss function when learning a new task. The purpose of penalties is to approximate the loss surface from previous tasks. The authors derive the penalty for the two-task case and then extrapolate to handling multiple tasks. I believe, however, that the penalties for multiple tasks are applied inconsistently.

In ref. 1 a separate penalty is maintained for each task T, centered at θT∗, the value of θ obtained after training on task T. When these penalties are combined (assuming λT=1), the aggregate penalty is anchored atμT=(FA+FB…+FT)−1(FAθA∗+FBθB∗…+FTθT∗).From the third task onward this is inconsistent with Bayesian inference. In the Bayesian paradigm the posterior p(θ |DA,DB) encapsulates the agent’s experience in both tasks A and B, thus rendering the previous posterior p(θ |DA) irrelevant. Analogously, as θB∗ was obtained while incorporating the penalty around θA∗, once we have θB∗, θA∗ is not needed anymore.

The correct form of penalties can be obtained by recursive application of the two-task derivation (2). It turns out that a single penalty is sufficient; its center is the latest optimum θT∗ and its weights are given by the sum of diagonal Fisher information matrices from previous tasks FA+FB+…+FT. This single penalty version is akin to Bayesian online learning (3).

If the agent has to revisit data from past training episodes, multiple penalties should be maintained similarly to expectation propagation (4, 5). The aggregate penalty should be centered at θT∗ rather than μT. The anchor point for task T’s penalty should therefore beθ∼T=FT−1((FA+FB+…FT)θT∗−FAθ∼A−…−FSθ∼S)rather than θT∗, where tasks A…S precede task T, and θ∼A…θ∼S are the respective penalty centers for these tasks.

To illustrate the behavior and effect of different penalties I applied EWC to a sequence of linear regression tasks. The tasks define quadratic losses each with a diagonal Hessian (Fig. 1, Left). As all simplifying assumptions of EWC hold exactly, one should expect it to match exact Bayesian inference: Reach the global minimum of the combined loss and do so irrespective of the order in which tasks were presented. Fig. 1, Center shows that although a quadratic penalty could model task B perfectly the penalty is placed around the wrong anchor point. Table 1 shows that this leads to suboptimal performance and an unwanted sensitivity to task ordering. By contrast, EWC using the corrected penalties around θ∼T models the losses perfectly (Fig. 1, Right) and achieves optimal performance in an order-agnostic fashion (Table 1).

Fig. 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 1.

(Left) Elliptical level sets of quadratic loss functions for tasks A, B, and C also used in Table 1. (Center) When learning task C via EWC, losses for tasks A and B are replaced by quadratic penalties around θA* and θB*. (Right) Losses are approximated perfectly by the correct quadratic penalties around θ∼A=θA∗ and θ∼B.

View this table:
  • View inline
  • View popup
Table 1.

Final performance of a model trained via different versions of EWC on a sequence of three linear regression tasks

I expect the negative impact of using incorrect penalties to be negligible until the network’s capacity begins to saturate. Furthermore, optimizing λ can compensate for the effects to some degree. As a result, it is possible that using incorrect penalties would result in no observable performance degradation in practice. However, since the correct penalties are just as easy to compute, and are clearly superior in some cases, I see no reason against adopting them.

Footnotes

  • ↵1Email: fhuszar@twitter.com.
  • Author contributions: F.H. designed research, performed research, and wrote the paper.

  • The author declares no conflict of interest.

Published under the PNAS license.

View Abstract

References

  1. ↵
    1. Kirkpatrick J, et al.
    (2017) Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci USA 114:3521–3526.
    OpenUrlAbstract/FREE Full Text
  2. ↵
    1. Huszár F
    (2017) On quadratic penalties in elastic weight consolidation. arXiv:1712.03847.
  3. ↵
    1. Saad D
    1. Opper M
    (1998) A Bayesian approach to on-line learning. On-Line Learning in Neural Networks, ed Saad D (Cambridge Univ Press, Cambridge, UK), pp 363–378.
  4. ↵
    1. Breese J,
    2. Koller D
    1. Minka TP
    (2001) Expectation propagation for approximate Bayesian inference. Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, eds Breese J, Koller D (Assoc for Computing Machinery, New York), pp 362–369.
  5. ↵
    1. Thrun S,
    2. Saul LK,
    3. Schölkopf B
    1. Eskin E,
    2. Smola AJ,
    3. Vishwanathan S
    (2004) Laplace propagation. Advances in Neural Information Processing Systems 16, eds Thrun S, Saul LK, Schölkopf B (MIT Press, Cambridge, MA), pp 441–448.
PreviousNext
Back to top
Article Alerts
Email Article

Thank you for your interest in spreading the word on PNAS.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Note on the quadratic penalties in elastic weight consolidation
(Your Name) has sent you a message from PNAS
(Your Name) thought you would like to see the PNAS web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Note on the quadratic penalties in elastic weight consolidation
Ferenc Huszár
Proceedings of the National Academy of Sciences Mar 2018, 115 (11) E2496-E2497; DOI: 10.1073/pnas.1717042115

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
Note on the quadratic penalties in elastic weight consolidation
Ferenc Huszár
Proceedings of the National Academy of Sciences Mar 2018, 115 (11) E2496-E2497; DOI: 10.1073/pnas.1717042115
Digg logo Reddit logo Twitter logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Mendeley logo Mendeley
Proceedings of the National Academy of Sciences: 115 (11)
Table of Contents

Submit

Sign up for Article Alerts

Article Classifications

  • Biological Sciences
  • Neuroscience
  • Physical Sciences
  • Applied Mathematics

Jump to section

  • Article
    • Footnotes
    • References
  • Figures & SI
  • Info & Metrics
  • PDF

You May Also be Interested in

Abstract depiction of a guitar and musical note
Science & Culture: At the nexus of music and medicine, some see disease treatments
Although the evidence is still limited, a growing body of research suggests music may have beneficial effects for diseases such as Parkinson’s.
Image credit: Shutterstock/agsandrew.
Large piece of gold
News Feature: Tracing gold's cosmic origins
Astronomers thought they’d finally figured out where gold and other heavy elements in the universe came from. In light of recent results, they’re not so sure.
Image credit: Science Source/Tom McHugh.
Dancers in red dresses
Journal Club: Friends appear to share patterns of brain activity
Researchers are still trying to understand what causes this strong correlation between neural and social networks.
Image credit: Shutterstock/Yeongsik Im.
Yellow emoticons
Learning the language of facial expressions
Aleix Martinez explains why facial expressions often are not accurate indicators of emotion.
Listen
Past PodcastsSubscribe
Goats standing in a pin
Transplantation of sperm-producing stem cells
CRISPR-Cas9 gene editing can improve the effectiveness of spermatogonial stem cell transplantation in mice and livestock, a study finds.
Image credit: Jon M. Oatley.

Similar Articles

Site Logo
Powered by HighWire
  • Submit Manuscript
  • Twitter
  • Facebook
  • RSS Feeds
  • Email Alerts

Articles

  • Current Issue
  • Special Feature Articles – Most Recent
  • List of Issues

PNAS Portals

  • Anthropology
  • Chemistry
  • Classics
  • Front Matter
  • Physics
  • Sustainability Science
  • Teaching Resources

Information

  • Authors
  • Editorial Board
  • Reviewers
  • Librarians
  • Press
  • Site Map
  • PNAS Updates

Feedback    Privacy/Legal

Copyright © 2021 National Academy of Sciences. Online ISSN 1091-6490