New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology
Note on the quadratic penalties in elastic weight consolidation
This Letter has a Reply and related content. Please see:
- Overcoming catastrophic forgetting in neural networks - March 14, 2017
- Reply to Huszár: The elastic weight consolidation penalty is empirically valid - February 20, 2018

Catastrophic forgetting is an undesired phenomenon which occurs when neural networks are trained on different tasks sequentially. Elastic weight consolidation (EWC; ref. 1), published in PNAS, is a novel algorithm designed to safeguard against this. Despite its satisfying simplicity, EWC is remarkably effective.
Motivated by Bayesian inference, EWC adds quadratic penalties to the loss function when learning a new task. The purpose of penalties is to approximate the loss surface from previous tasks. The authors derive the penalty for the two-task case and then extrapolate to handling multiple tasks. I believe, however, that the penalties for multiple tasks are applied inconsistently.
In ref. 1 a separate penalty is maintained for each task T, centered at
The correct form of penalties can be obtained by recursive application of the two-task derivation (2). It turns out that a single penalty is sufficient; its center is the latest optimum
If the agent has to revisit data from past training episodes, multiple penalties should be maintained similarly to expectation propagation (4, 5). The aggregate penalty should be centered at
To illustrate the behavior and effect of different penalties I applied EWC to a sequence of linear regression tasks. The tasks define quadratic losses each with a diagonal Hessian (Fig. 1, Left). As all simplifying assumptions of EWC hold exactly, one should expect it to match exact Bayesian inference: Reach the global minimum of the combined loss and do so irrespective of the order in which tasks were presented. Fig. 1, Center shows that although a quadratic penalty could model task B perfectly the penalty is placed around the wrong anchor point. Table 1 shows that this leads to suboptimal performance and an unwanted sensitivity to task ordering. By contrast, EWC using the corrected penalties around
(Left) Elliptical level sets of quadratic loss functions for tasks A, B, and C also used in Table 1. (Center) When learning task C via EWC, losses for tasks A and B are replaced by quadratic penalties around
Final performance of a model trained via different versions of EWC on a sequence of three linear regression tasks
I expect the negative impact of using incorrect penalties to be negligible until the network’s capacity begins to saturate. Furthermore, optimizing λ can compensate for the effects to some degree. As a result, it is possible that using incorrect penalties would result in no observable performance degradation in practice. However, since the correct penalties are just as easy to compute, and are clearly superior in some cases, I see no reason against adopting them.
Footnotes
- ↵1Email: fhuszar@twitter.com.
Author contributions: F.H. designed research, performed research, and wrote the paper.
The author declares no conflict of interest.
Published under the PNAS license.
References
- ↵
- Kirkpatrick J, et al.
- ↵
- Huszár F
- ↵
- Saad D
- Opper M
- ↵
- Breese J,
- Koller D
- Minka TP
- ↵
- Thrun S,
- Saul LK,
- Schölkopf B
- Eskin E,
- Smola AJ,
- Vishwanathan S
Citation Manager Formats
Sign up for Article Alerts
Article Classifications
- Biological Sciences
- Neuroscience
- Physical Sciences
- Applied Mathematics