# On instabilities of deep learning in image reconstruction and the potential costs of AI

^{a}Department of Mathematics, University of Oslo, 0316 Oslo, Norway;^{b}Instituto de Telecomunicações, Faculdade de Ciências, Universidade do Porto, Porto 4169-007, Portugal;^{c}Department of Mathematical Sciences, University of Bath, Bath BA2 7AY, United Kingdom;^{d}Department of Mathematics, Simon Fraser University, Burnaby, BC V5A 1S6, Canada;^{e}Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge CB3 0WA, United Kingdom

See allHide authors and affiliations

Edited by David L. Donoho, Stanford University, Stanford, CA, and approved March 12, 2020 (received for review June 4, 2019)

## Abstract

Deep learning, due to its unprecedented success in tasks such as image classification, has emerged as a new tool in image reconstruction with potential to change the field. In this paper, we demonstrate a crucial phenomenon: Deep learning typically yields unstable methods for image reconstruction. The instabilities usually occur in several forms: 1) Certain tiny, almost undetectable perturbations, both in the image and sampling domain, may result in severe artefacts in the reconstruction; 2) a small structural change, for example, a tumor, may not be captured in the reconstructed image; and 3) (a counterintuitive type of instability) more samples may yield poorer performance. Our stability test with algorithms and easy-to-use software detects the instability phenomena. The test is aimed at researchers, to test their networks for instabilities, and for government agencies, such as the Food and Drug Administration (FDA), to secure safe use of deep learning methods.

There are two paradigm changes currently happening: 1) Artificial intelligence (AI) is replacing humans in problem solving; however, 2) AI is also replacing the standard algorithms in computational science and engineering. Since reliable numerical calculations are paramount, algorithms for computational science are traditionally based on two pillars: accuracy and stability. This is, in particular, true of image reconstruction, which is a mainstay of computational science, providing fundamental tools in medical, scientific, and industrial imaging. This paper demonstrates that the stability pillar is typically absent in current deep learning and AI-based algorithms for image reconstruction. This raises two fundamental questions: How reliable are such algorithms when applied in the sciences, and do AI-based algorithms have an unavoidable Achilles heel: instability? This paper introduces a comprehensive testing framework designed to demonstrate, investigate, and, ultimately, answer these foundational questions.

The importance of stable and accurate methods for image reconstruction for inverse problems is hard to overestimate. These techniques form the foundation for essential tools across the physical and life sciences such as MRI, computerized tomography (CT), fluorescence microscopy, electron tomography, NMR, radio interferometry, lensless cameras, etc. Moreover, stability is traditionally considered a necessity in order to secure reliable and trustworthy methods used in, for example, cancer diagnosis. Hence, there is an extensive literature on designing stable methods for image reconstruction in inverse problems (1⇓⇓–4).

AI techniques such as deep learning and neural networks (5) have provided a new paradigm with new techniques in inverse problems (6⇓⇓⇓⇓⇓⇓⇓⇓–15) that may change the field. In particular, the reconstruction algorithms learn how to best do the reconstruction based on training from previous data, and, through this training procedure, aim to optimize the quality of the reconstruction. This is a radical change from the current state of the art (SoA) from an engineering, physical, and mathematical point of view.

AI and deep learning have already changed the field of computer vision and image classification (16⇓⇓–19), where the performance is now referred to as super human (20). However, the success comes with a price. Indeed, the methods are highly unstable. It is now well established (21⇓⇓⇓–25) that high-performance deep learning methods for image classification are subject to failure given tiny, almost invisible perturbation of the image. An image of a cat may be classified correctly; however, a tiny change, invisible to the human eye, may cause the algorithm to change its classification label from cat to fire truck, or another label far from the original.

In this paper, we establish the instability phenomenon of deep learning in image reconstruction for inverse problems. A potential surprising conclusion is that the phenomenon may be independent of the underlying mathematical model. For example, MRI is based on sampling the Fourier transform, whereas CT is based on sampling the Radon transform. These are rather different models, yet the instability phenomena happen for both sampling modalities when using deep learning.

There is, however, a big difference between the instabilities of deep learning for image classification and our results on instabilities of deep learning for image reconstruction. Firstly, in the former case, there is only one thing that could go wrong: A small perturbation results in a wrong classification. In image reconstruction, there are several potential forms of instabilities. In particular, we consider three crucial issues: 1) instabilities with respect to certain tiny perturbations, 2) instabilities with respect to small structural changes (for example a brain image with or without a small tumor), and 3) instabilities with respect to changes in the number of samples. Secondly, the two problems are totally unrelated. Indeed, the former problem is, in its simplest form, a decision problem, and hence the decision function (“Is there a cat in the image?”) to be approximated is necessarily discontinuous. However, the problem of reconstructing an image from Fourier coefficients, as is the problem in MRI, is completely different. In this case, there exist stable and accurate methods that depend continuously on the input. It is therefore paradoxical that deep learning leads to unstable methods for problems that can be solved accurately in a stable way (*SI Appendix*, *Methods*).

The networks we have tested are unstable either in the form of category 1 or 2 or both. Moreover, networks that are highly stable in one of the categories tend to be highly unstable in the other. The instability in form of category 3, however, occurs for some networks but not all. The findings raise two fundamental questions:

1) Does AI, as we know it, come at a cost? Is instability a necessary by-product of our current AI techniques?

2) Can reconstruction methods based on deep learning always be safely used in the physical and life sciences? Or, are there cases for which instabilities may lead to, for example, incorrect medical diagnosis if applied in medical imaging?

The scope of this paper is on the second question, as the first question is on foundations, and our stability test provides the starting point for answering question 2. However, even if instabilities occur, this should not rule out the use of deep learning methods in inverse problems. In fact, one may be able to show, with large empirical statistical tests, that the artifacts caused by instabilities occur infrequently. As our test reveals, there is a myriad of different artifacts that may occur, as a result of the instabilities, suggesting vast efforts needed to answer question 2. A detailed account is provided in *Conclusion* .

## The Instability Test

The instability test is based on the three instability issues mentioned above. We consider instabilities with respect to the following.

### Tiny Worst-Case Perturbations.

The tiny perturbation could be in the image domain or in the sampling domain. When considering medical imaging, a perturbation in the image domain could come from a slight movement of the patient, small anatomic differences between people, etc. The perturbation in the sampling domain may be caused by malfunctioning of the equipment or the inevitable noise dictated by the physical model of the scanning machine. However, a perturbation in the image domain may imply a perturbation in the sampling domain. Also, in many cases, the mathematical model of the sampling reveals that such a sampling process implies an operator that is surjective onto its range, and hence there exists a perturbation in the image domain corresponding to the perturbation in the sampling domain. Thus, a combination of all these factors may yield perturbations that, in a worst-case scenario, may be quite specific, hard to model, and hard to protect against, unless one has a completely stable neural network.

The instability test includes algorithms that do the following. Given an image and a neural network, designed for image reconstruction from samples provided by a specific sampling modality, the algorithm searches for a perturbation of the image that makes the most severe change in the output of the network while still keeping the perturbation small. In a simple mathematical form, this can be described as follows. Given an image *Methods* for details. However, the perturbation could, of course, be put on the measurement vector y instead.

### Small Structural Changes in the Image.

By structural change, we mean a change in the image domain that may not be tiny, and typically is significant and clearly visible, but is still small (for example, a small tumor). The purpose is to check whether the network can recover important details that are crucial in, for example, medical assessments. In particular, given the image

An important note is that, when testing stability, both with respect to tiny perturbations and with respect to small structural changes, the test is always done in comparison with an SoA stable method in order to check that any instabilities produced by the neural network are due to the network itself and not because of ill-conditioning of the inverse problem. The SoA methods used are based on compressed sensing and sparse regularization (26⇓–28). These methods often come with mathematical stability guaranties (29), and are hence suitable as benchmarks (see *Methods* for details).

### Changing the Number of Samples in the Sampling Device (Such as the MRI or CT Scanner).

Typical SoA methods share a common quality: More samples imply better quality of the reconstruction. Given that deep learning neural networks in inverse problems are trained given a specific sampling pattern, the question is, How robust is the trained network with respect to changes in the sampling? The test checks whether the quality of the reconstruction deteriorates with more samples. This is a crucial question in applications. For example, the recent implementation of compressed sensing on Philips MRI machines allows the user to change the undersampling ration for every scan. This means that, if a network is trained on

## Testing the Test

We test six deep learning neural networks selected based on their strong performance, wide range in architectures, and difference in sampling patterns and subsampling ratios, as well as their difference in training data. The specific details about the architecture and the training data of the tested networks can be found in *SI Appendix*.

An important note is that the tests performed are not designed to test deep learning against SoA in terms of performance on specific images. The test is designed to detect the instability phenomenon. Hence, the comparison with SoA is only to verify that the instabilities are exclusive only to neural networks based on deep learning, and not due to an ill-conditioning of the problem itself. Moreover, as is clear from the images, in the unperturbed cases, the best performance varies between neural networks and SoA. The list of networks is as follows.

AUTOMAP (6) is a neural network for low-resolution single-coil MRI with 60% subsampling. The training set consists of brain images with white noise added to the Fourier samples.

DAGAN (12) is a network for medium-resolution single-coil MRI with 20% subsampling, and is trained with a variety of brain images.

Deep MRI (11) is a neural network for medium-resolution single-coil MRI with 33% subsampling. It is trained with detailed cardiac MR images.

Ell 50 (9) is a network for CT or any Radon transform-based inverse problem. It is trained on images containing solely ellipses (hence the name Ell 50). The number 50 refers to the number of lines used in the sampling in the sinogram.

Med 50 (9) has exactly the same architecture as Ell 50 and is used for CT; however, it is trained with medical images (hence the name Med 50) from the Mayo Clinic database (13). The number of lines used in the sampling from the sinogram is 50.

MRI-VN (14) is a network for medium- to high-resolution parallel MRI with 15 coil elements and 15% subsampling. The training is done with a variety of knee images.

## Stability with Respect to Tiny Worst-Case Perturbations

Below follows the description of the test applied to some of the networks where we detect instabilities with respect to tiny perturbations.

For the Deep MRI test, we perturb the image x with a sequence of perturbations *Bottom*. Note that the perturbations are almost invisible to the human eye, as demonstrated in Fig. 1., *Top*. The *Lower Middle*. Note also that the instabilities are actually stable. In particular, in Fig. 2, we demonstrate how a random Gaussian perturbation added to the perturbation *SI Appendix*, *Methods*).

The AUTOMAP experiment is similar to the one above; however, in this case, we add *Top*, where *Middle*, and Fig. 3, *Bottom* contains the reconstruction done by an SoA method. Note that the worst-case perturbations are completely different than the ones failing the Deep MRI network. Hence, the artifacts are also completely different. These perturbations are white noise-like, and the reconstructions from the network provide a similar impression. As this is a standard artifact in MRI, it is, however, not clear how to protect against the potential bad tiny noise. Indeed, a detail may be washed out, as shown in the experiment (note the heart inserted with slightly different intensities in the brain image), but the similarity to a standard artifact may make it difficult to judge that this is an untrustworthy image.

In the case of MRI-VN, we add one perturbation

For Med-50, we add a perturbation

## Stability with Respect to Small Structural Changes

Instabilities with respect to small structural changes are documented below.

The Ell-50 network provides a stark example of instability with respect to structural perturbation. Indeed, none of the details are visible in the reconstruction as documented in Fig. 5, *Top*. This may not be entirely surprising, given that the network is trained on ellipses.

The DAGAN network is not as unstable as the Ell-50 network with respect to structural changes. However, as seen in Fig. 5, *Upper Middle*, the blurring of the structural details are substantial, and the instability is still critical.

MRI-VN is an example of a moderately unstable network when considering structural changes. Note, however, how the instability coincides with the lack of ability to reconstruct details in general. This is documented in Fig. 5, *Middle*.

For Deep MRI, to demonstrate how the stability with respect to small structured changes coincides with the ability to reconstruct details, we show how stable the Deep MRI network is. Observe also how well the details in the image are preserved in Fig. 5, *Lower Middle*. Here we have lowered the subsampling ration to

## Stability with Respect to More Samples

Certain convolutional neural networks will allow for the flexibility of changing the amount of sampling. In our test cases, all of the networks except AUTOMAP have this feature, and we report on the stability with respect to changes in the amount of samples below and in Fig. 5, *Bottom*.

Ell 50 has the strongest and most fascinating decay in performance as a function of an increasing subsampling ratio. Med 50 is similar, however, with a less steep decline in reconstruction quality.

For DAGAN, the reconstruction quality deteriorates with more samples, similar to the Ell 50/Med 50 networks.

The VN-MRI network provides reconstructions where the quality stagnates with more samples, as opposed to the decay in performance witnessed in the other cases.

The Deep MRI network is the only one that behaves in a way aligned with standard SoA methods and provides better reconstructions when more samples are added.

## Conclusion

The new paradigm of learning the reconstruction algorithm for image reconstruction in inverse problem, through deep learning, typically yields unstable methods. Moreover, our test reveals numerous instability phenomena, challenges, and research directions. In particular, we find the following:

1) Certain tiny perturbations lead to a myriad of different artifacts. Different networks yield different artifacts and instabilities, and, as Figs. 1, 3, and 4 reveal, there is no common denominator. Moreover, the artifacts may be difficult to detect as nonphysical. Thus, several key questions emerge: Given a trained neural network, which types of artifacts may the network produce? How is the instability related to the network architecture, training set, and also subsampling patterns?

2) There is variety in the failure of recovering structural changes. There is a great variety in the instabilities with respect to structural changes as demonstrated in Fig. 4, ranging from complete removal of details to more subtle distortions and blurring of the features. How is this related to the network architecture and training set? Moreover, does the subsampling pattern play a role? It is important, however, to observe (as in Fig. 5,

*Lower Middle*and the first column of Fig. 3) that there are perfectly stable networks with respect to structural changes, even when the training set does not contain any images with such details.3) Networks must be retrained on any subsampling pattern. The fact that more samples may cause the quality of reconstruction to either deteriorate or stagnate means that each network has to be retrained on every specific subsampling pattern, subsampling ratio, and dimensions used. Hence, one may, in practice, need hundreds of different network to facilitate the many different combinations of dimensions, subsampling ratios, and sampling patterns.

4) Instabilities are not necessarily rare events. A key question regarding instabilities with respect to tiny perturbations is whether they may occur in practice. The example in Fig. 2 suggests that there is a ball around a worst-case perturbation in which the severe artifacts are always witnessed. This suggests that the set of “bad” perturbations have Lebesgue measure greater than zero, and, thus, there will typically be a nonzero probability of a “bad” perturbation. Estimating this probability may be highly nontrivial, as the perturbation will typically be the sum of two random variables, where one variable comes from generic noise and one highly nongeneric variable is due to patient movements, anatomic differences, apparatus malfunctions, etc. These predictions can also be theoretically verified, as discussed in

*SI Appendix*,*Methods*.5) The instability phenomenon is not easy to remedy. We deliberately choose quite different networks in this paper to highlight the seeming ubiquity of the instability phenomenon. Theoretical insights [see

*SI Appendix*,*Methods*on the next generation of methods (30⇓⇓⇓–34)] also support the conclusion that this phenomenon is nontrivial to overcome. Finding effective remedies is an extremely important future challenge.

### Code and Data.

All of the code is available from https://github.com/vegarant/Invfool.

## Acknowledgments

We thank Kristian Monsen Haug for help with *SI Appendix*, Fig. S3. We thank Dr. Cynthia McCollough, the Mayo Clinic, the American Association of Physicists in Medicine, and the National Institute of Biomedical Imaging and Bioengineering for allowing the use of their data in the experiments. F.R. acknowledges support from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie Grant Agreement 655282 and funds through FCT (Fundação para a Ciência e a Tecnologia, I.P.), under the Scientific Employment Stimulus – Individual Call – CEECIND/01970/2017. B.A. acknowledges support from Natural Sciences and Engineering Research Council of Canada (NSERC) Grant 611675. A.C.H. thanks Nvidia for a graphics processing unit (GPU) grant in the form of a Titan X Pascal and acknowledges support from a Royal Society University Research Fellowship, UK Engineering and Physical Sciences Research Council Grant EP/L003457/1, and a Leverhulme Prize 2017.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: ach70{at}cam.ac.uk.

Author contributions: B.A. and A.C.H. designed research; V.A., F.R., and C.P. performed research; V.A., F.R., C.P., B.A., and A.C.H. wrote the paper; and V.A., F.R., and C.P. wrote code.

The authors declare no competing interest.

This article is a PNAS Direct Submission.

This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, "The Science of Deep Learning," held March 13–14, 2019, at the National Academy of Sciences in Washington, DC. NAS colloquia began in 1991 and have been published in PNAS since 1995. From February 2001 through May 2019 colloquia were supported by a generous gift from The Dame Jillian and Dr. Arthur M. Sackler Foundation for the Arts, Sciences, & Humanities, in memory of Dame Sackler’s husband, Arthur M. Sackler. The complete program and video recordings of most presentations are available on the NAS website at http://www.nasonline.org/science-of-deep-learning.

Data deposition: All of the code is available from GitHub at https://github.com/vegarant/Invfool.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1907377117/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- ↵
- V. Studer et al.

- ↵
- H. W. Engl,
- M. Hanke,
- A. Neubauer

- ↵
- P. R. Johnston

- P. C. Hansen

- ↵
- ↵
- B. Zhu,
- J. Z. Liu,
- S. F. Cauley,
- B. R. Rosen,
- M. S. Rosen

- ↵
- R. Strack

- ↵
- M. T. McCann,
- K. H. Jin,
- M. Unser

- ↵
- ↵
- S. Bengio et al.

- M. Mardani et al.

- ↵
- M. Niethammer et al.

- J. Schlemper,
- J. Caballero,
- J. V. Hajnal,
- A. Price,
- D. Rueckert

- ↵
- ↵
- C. McCollough

- ↵
- ↵
- A. Lucas,
- M. Iliadis,
- R. Molina,
- A. K. Katsaggelos

- ↵
- M. Elad

- ↵
- R. Girshick,
- J. Donahue,
- T. Darrell,
- J. Malik

- ↵
- F. Pereira,
- C. J. C. Burges,
- L. Bottou,
- K. Q. Weinberger

- A. Krizhevsky,
- I. Sutskever,
- G. E. Hinton

- ↵
- Z. Ghahramani,
- M. Welling,
- C. Cortes,
- N. D. Lawrence,
- K. Q. Weinberger

- B. Zhou,
- A. Lapedriza,
- J. Xiao,
- A. Torralba,
- A. Oliva

- ↵
- K. He,
- X. Zhang,
- S. Ren,
- J. Sun

- ↵
- C. Kanbak,
- S.-M. Moosavi-Dezfooli,
- P. Frossard

- ↵
- S.-M. Moosavi-Dezfooli,
- A. Fawzi,
- P. Frossard

- ↵
- S. Moosavi-Dezfooli,
- A. Fawzi,
- O. Fawzi,
- P. Frossard

- ↵
- C. Szegedy,
- W. Zaremba,
- I. Sutskever,
- J. Bruna,
- D. Erhan,
- I. J. Goodfellow,
- R. Fergus

- ↵
- A. Fawzi,
- S.-M. Moosavi-Dezfooli,
- P. Frossard

- ↵
- ↵
- ↵
- ↵
- B. Adcock,
- A. C. Hansen,
- C. Poon,
- B. Roman

- ↵
- H. Gupta,
- K. H. Jin,
- H. Q. Nguyen,
- M. T. McCann,
- M. Unser

- ↵
- J. Adler,
- O. Öktem

- ↵
- K. C. Tezcan,
- C. F. Baumgartner,
- R. Luechinger,
- K. P. Pruessmann,
- E. Konukoglu

*,*38, 1633–1642 (2019). - ↵
- I. Guyon et al.

- S. A. Bigdeli,
- M. Zwicker,
- P. Favaro,
- M. Jin

- ↵
- J. H. R. Chang,
- C.-L. Li,
- B. Poczos,
- B. V. K. Vijaya Kumar,
- A. C. Sankaranarayanan

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Applied Mathematics