Linear Recursive Feature Machines provably recover low-rank matrices

Edited by Peter Bickel, University of California Berkeley, Berkeley, CA; received June 6, 2024; accepted February 5, 2025
March 28, 2025
122 (13) e2411325122

Significance

Understanding the mechanism of how neural networks learn features from data is a fundamental problem in machine learning. Our work explicitly connects the mechanism of neural feature learning to a widely used class of algorithms for sparse recovery called iteratively re-weighted least squares (IRLS). Furthermore, our results suggest that nonlinear neural network training can be viewed as a nonlinear extension of the IRLS algorithm. We envision that these results will be fruitful in understanding the effectiveness of networks used today and building the next generation of theoretically grounded models.

Abstract

A fundamental problem in machine learning is to understand how neural networks make accurate predictions, while seemingly bypassing the curse of dimensionality. A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction—a process called feature learning. Recent work [A. Radhakrishnan, D. Beaglehole, P. Pandit, M. Belkin, Science 383, 1461–1467 (2024).] posited that the effects of feature learning can be elicited from a classical statistical estimator called the average gradient outer product (AGOP). The authors proposed Recursive Feature Machines (RFMs) as an algorithm that explicitly performs feature learning by alternating between 1) reweighting the feature vectors by the AGOP and 2) learning the prediction function in the transformed space. In this work, we develop theoretical guarantees for how RFM performs dimensionality reduction by focusing on the class of overparameterized problems arising in sparse linear regression and low-rank matrix recovery. Specifically, we show that RFM restricted to linear models (lin-RFM) reduces to a variant of the well-studied Iteratively Reweighted Least Squares (IRLS) algorithm. Furthermore, our results connect feature learning in neural networks and classical sparse recovery algorithms and shed light on how neural networks recover low rank structure from data. In addition, we provide an implementation of lin-RFM that scales to matrices with millions of missing entries. Our implementation is faster than the standard IRLS algorithms since it avoids forming singular value decompositions. It also outperforms deep linear networks for sparse linear regression and low-rank matrix completion.

Get full access to this article

Purchase, subscribe or recommend this article to your librarian.

Data, Materials, and Software Availability

All data used in the experiments is publicly available here in perpetuity: https://github.com/aradha/lin-RFM (57). All other data are included in the article and/or SI Appendix.

Acknowledgments

A.R. is supported by the George F. Carrier Postdoctoral Fellowship in the School of Engineering and Applied Sciences at Harvard University. M.B. acknowledges support from NSF and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning (https://deepfoundations.ai/) through awards DMS-2031883 and #814639 as well as the The Institute for Learning-Enabled Optimization at Scale institute (NSF CCF-2112665) and the Office of Naval Research (N000142412631). This work used the programs 1) Extreme science and engineering discovery environment which is supported by NSF grant numbers ACI-1548562, and 2) Advanced cyberinfrastructure coordination ecosystem: services and support which is supported by NSF grants numbers #2138259, #2138286, #2138307, #2137603, and #2138296. Specifically, we used the resources from San Diego Supercomputer Center Expanse Graphics Processing Unit compute nodes, and National Center for Supercomputing Applications Delta system, via allocations TG-CIS220009. D.D. was supported by NSF DMS-2306322, NSF CCF 1740551, and AFOSR FA9550-24-1-0092 awards.

Author contributions

A.R., M.B., and D.D. designed research; performed research; and wrote the paper.

Competing interests

The authors declare no competing interest.

Supporting Information

Appendix 01 (PDF)

References

1
OpenAI, GPT-4 turbo: Enhanced version of GPT-4 language model (Online). OpenAI (2023). https://platform.openai.com/docs/models/gpt-4-turbo. Accessed 12 March 2025.
2
K. Tunyasuvunakool et al., Highly accurate protein structure prediction for the human proteome. Nature 596, 1–9 (2021).
3
A. I. Anthropic, Claude 2: The next generation language model (Online). claude.ai (2023). https://claude.ai/new. Accessed 12 March 2025.
4
J. Wright, Y. Ma, High-Dimensional Data Analysis with Low-dimensional Models: Principles, Computation, and Applications (Cambridge University Press, 2022).
5
D. L. Donoho, Compressed sensing. IEEE Trans. Inf. Theory 52, 1289–1306 (2006).
6
G. Yang, E. J. Hu, “Tensor programs iv: Feature learning in infinite-width neural networks” in International Conference on Machine Learning (PMLR, 2021), pp. 11727–11737.
7
Z. Shi, J. Wei, Y. Liang, “A theoretical analysis on feature learning in neural networks: Emergence from inputs and advantage over fixed features” in the 10th International Conference on Learning Representations (ICLR 2022).
8
A. Damian, J. Lee, M. Soltanolkotabi, “Neural networks can learn representations with gradient descent” in Conference on Learning Theory (PMLR, 2022), pp. 5413–5452.
9
S. Parkinson, G. Ongie, R. Willett, Linear neural network layers promote learning single-and multiple-index models. arXiv [Preprint] (2023). https://arxiv.org/abs/2305.15598v1 (Accessed 12 March 2025).
10
B. Woodworth et al., “Kernel and rich regimes in overparametrized models” in Conference on Learning Theory (PMLR, 2020), pp. 3635–3673.
11
S. Arora, N. Cohen, W. Hu, Y. Luo, Implicit regularization in deep matrix factorization. NeurIPS 32, 7413–7424 (2019).
12
A. Radhakrishnan, D. Beaglehole, P. Pandit, M. Belkin, Mechanism for feature learning in neural networks and backpropagation-free machine learning models. Science 383, 1461–1467 (2024).
13
D. Beaglehole, A. Radhakrishnan, P. Pandit, M. Belkin, Mechanism of feature learning in convolutional neural networks. arXiv [Preprint] (2023). https://arxiv.org/abs/2309.00570 (Accessed 12 March 2025).
14
S. Gunasekar, J. D. Lee, D. Soudry, N. Srebro, Implicit bias of gradient descent on linear convolutional networks. Adv. Neural Inf. Process. Syst. 31, 00468 (2018).
15
F. Shang et al., A unified scalable equivalent formulation for schatten quasi-norms. Mathematics 8, 1325 (2020).
16
K. Mohan, M. Fazel, Iterative reweighted algorithms for matrix rank minimization. J. Mach. Learn. Res. 13, 3441–3473 (2012).
17
L. Ding, D. Drusvyatskiy, M. Fazel, Z. Harchaoui, “Flat minima generalize for low-rank matrix recovery” in Information and Inference: A Journal of the IMA (Oxford University Press, 2024), vol. 13, pp. iaae009.
18
M. S. Nacson, K. Ravichandran, N. Srebro, D. Soudry, “Implicit bias of the step size in linear diagonal neural networks” in International Conference on Machine Learning (PMLR, 2022), pp. 16270–16295.
19
D. L. Donoho, P. B. Stark, Uncertainty principles and signal recovery. SIAM J. Appl. Math. 49, 906–931 (1989).
20
R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58, 267–288 (1996).
21
J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001).
22
I. Daubechies, R. DeVore, M. Fornasier, C. S. Güntürk, Iteratively reweighted least squares minimization for sparse recovery. Commun. Pure Appl. Math. J. Issued Courant Ins. Math. Sci. 63, 1–38 (2010).
23
R. Chartrand, W. Yin, “Iteratively reweighted algorithms for compressive sensing” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 2008), pp. 3869–3872.
24
G. Merle, H. Späth, Computational experiences with discrete lp-approximation. Computing 12, 315–321 (1974).
25
Y. Li, A globally convergent method for p problems. SIAM J. Optim. 3, 609–629 (1993).
26
E. Candès, B. Recht, Exact matrix completion via convex optimization. Commun. ACM 55, 111–119 (2012).
27
E. J. Candès, T. Tao, The power of convex relaxation: Near-optimal matrix completion. Ins. Electr. Electron. Eng. Trans. Inf. Theory 56, 2053–2080 (2010).
28
B. Recht, M. Fazel, P. A. Parrilo, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. Soc. Ind. Appl. Math. Rev. 52, 471–501 (2010).
29
J. F. Cai, E. J. Candès, Z. Shen, A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20, 1956–1982 (2010).
30
M. Fornasier, H. Rauhut, R. Ward, Low-rank matrix recovery via iteratively reweighted least squares minimization. SIAM J. Optim. 21, 1614–1640 (2011).
31
E. J. Candès, T. Tao, Decoding by linear programming. IEEE Trans. Inf. Theory 51, 4203–4215 (2005).
32
S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, N. Srebro, Implicit regularization in matrix factorization. NeurIPS 30, 6151–6159 (2017).
33
Z. Dai, M. Karzand, N. Srebro, Representation costs of linear neural networks: Analysis and design. Adv. Neural Inf. Process. Syst. 34, 26884–26896 (2021).
34
A. Aghajanyan, L. Zettlemoyer, S. Gupta, Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv [Preprint] (2020). https://arxiv.org/abs/2012.13255 (Accessed 5 November 2024).
35
E. J. Hu et al., Lora: Low-rank adaptation of large language models. Int. Conf. Learn. Represent. 1, 3 (2022).
36
E. Abbe, E. B. Adsera, T. Misiakiewicz, “The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks” in Conference on Learning Theory (PMLR, 2022), pp. 4782–4887.
37
A. Mousavi-Hosseini, S. Park, M. Girotti, I. Mitliagkas, M. A. Erdogdu, “Neural networks efficiently learn low-dimensional representations with SGD” in The Eleventh International Conference on Learning Representations (2023).
38
M. K. Warmuth, W. Kotłowski, E. Amid, “A case where a spindly two-layer linear network decisively outperforms any neural network with a fully connected input layer” in Algorithmic Learning Theory (PMLR, 2021), pp. 1214–1236.
39
A. Jacot, Bottleneck structure in learned features: Low-dimension vs. regularity tradeoff. Adv. Neural Inf. Process. Syst. 36, 23607–23629 (2023).
40
A. Jacot, “Implicit bias of large depth networks: A notion of rank for nonlinear functions” in International Conference on Learning Representations (2023).
41
A. S. Lewis, Derivatives of spectral functions. Math. Oper. Res. 21, 576–588 (1996).
42
M. Fazel, H. Hindi, S. P. Boyd, “Log-det heuristic for matrix rank minimization with applications to hankel and euclidean distance matrices” in Proceedings of the 2003 American Control Conference, 2003 (IEEE, 2003), vol. 3, pp. 2156–2162.
43
F. Bach, The “η-trick” or the effectiveness of reweighted least-squares (2019). https://francisbach.com/the-%CE%B7-trick-or-the-effectiveness-of-reweighted-least-squares/. Accessed November 4 2024.
44
D. LeJeune, H. Javadi, R. Baraniuk, The flip side of the reweighted coin: Duality of adaptive dropout and regularization. Adv. Neural Inf. Process. Syst. 34, 23401–23412 (2021).
45
B. D. Rao, K. Kreutz-Delgado, An affine scaling methodology for best basis selection. IEEE Trans. Signal Process. 47, 187–200 (1999).
46
N. Razin, N. Cohen, Implicit regularization in deep learning may not be explainable by norms. Adv. Neural Inf. Process. Syst. 33, 21174–21187 (2020).
47
S. Arora, N. Cohen, E. Hazan, “On the optimization of deep networks: Implicit acceleration by overparameterization” in International conference on machine learning, J. Dy, A. Krause, Eds. (PMLR, 2018), vol. 80, pp. 244–253.
48
S. Arora, N. Golowich, N. Cohen, W. Hu, “A convergence analysis of gradient descent for deep linear neural networks” in International Conference on Learning Representations (2019).
49
M. Turk, A. Pentland, Eigenfaces for recognition. J. Cogn. Neurosci. 3, 71–86 (1991).
50
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, R. Harshman, Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41, 391–407 (1990).
51
Wikipedia contributors, Nonlinear dimensionality reduction — Wikipedia, the free encyclopedia (2023). https://en.wikipedia.org/w/index.php?title=Nonlinear_dimensionality_reductionoldid=1179642279. Accessed 16 December 2023.
52
Wikipedia contributors, Compressed sensing – Wikipedia, the free encyclopedia (2023). https://en.wikipedia.org/w/index.php?title=Compressed_sensing&oldid=1187577245. Accessed 16 December 2023.
53
S. Van Der Walt, S. C. Colbert, G. Varoquaux, The numpy array: A structure for efficient numerical computation. Comput. Sci. Eng. 13, 22 (2011).
54
A. Paszke et al., Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).
55
S. Diamond, S. Boyd, CVXPY: A Python-embedded modeling language for convex optimization. J. Mach. Learn. Res. 17, 1–5 (2016).
56
A. Agrawal, R. Verschueren, S. Diamond, S. Boyd, A rewriting system for convex optimization problems. J. Control Decis. 5, 42–60 (2018).
57
A. Radhakrishnan, M. Belkin, D. Drusvyatskiy, lin-RFM. Zenodo. https://doi.org/10.5281/zenodo.15015325. Deposited 12 March 2025.

Information & Authors

Information

Published in

The cover image for PNAS Vol.122; No.13
Proceedings of the National Academy of Sciences
Vol. 122 | No. 13
April 1, 2025
PubMed: 40153460

Classifications

Data, Materials, and Software Availability

All data used in the experiments is publicly available here in perpetuity: https://github.com/aradha/lin-RFM (57). All other data are included in the article and/or SI Appendix.

Submission history

Received: June 6, 2024
Accepted: February 5, 2025
Published online: March 28, 2025
Published in issue: April 1, 2025

Keywords

  1. neural networks
  2. matrix sensing
  3. feature learning
  4. sparse recovery

Acknowledgments

A.R. is supported by the George F. Carrier Postdoctoral Fellowship in the School of Engineering and Applied Sciences at Harvard University. M.B. acknowledges support from NSF and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning (https://deepfoundations.ai/) through awards DMS-2031883 and #814639 as well as the The Institute for Learning-Enabled Optimization at Scale institute (NSF CCF-2112665) and the Office of Naval Research (N000142412631). This work used the programs 1) Extreme science and engineering discovery environment which is supported by NSF grant numbers ACI-1548562, and 2) Advanced cyberinfrastructure coordination ecosystem: services and support which is supported by NSF grants numbers #2138259, #2138286, #2138307, #2137603, and #2138296. Specifically, we used the resources from San Diego Supercomputer Center Expanse Graphics Processing Unit compute nodes, and National Center for Supercomputing Applications Delta system, via allocations TG-CIS220009. D.D. was supported by NSF DMS-2306322, NSF CCF 1740551, and AFOSR FA9550-24-1-0092 awards.
Author contributions
A.R., M.B., and D.D. designed research; performed research; and wrote the paper.
Competing interests
The authors declare no competing interest.

Notes

This article is a PNAS Direct Submission.

Authors

Affiliations

Adityanarayanan Radhakrishnan
Applied Math, Harvard University, MA 02138
Broad Institute of Massachusetts Institute of Technology and Harvard, MA 02138
Mikhail Belkin1 [email protected]
Halıcıoğlu Data Science Institute, University of California San Diego, CA 92093
Dmitriy Drusvyatskiy
Mathematics Department, University of Washington, WA 98195

Notes

1
To whom correspondence may be addressed. Email: [email protected].

Metrics & Citations

Metrics

Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.


Altmetrics

Citations

Export the article citation data by selecting a format from the list below and clicking Export.

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Personal login Institutional Login

Recommend to a librarian

Recommend PNAS to a Librarian

Purchase options

Purchase this article to access the full text.

Single Article Purchase

Linear Recursive Feature Machines provably recover low-rank matrices
Proceedings of the National Academy of Sciences
  • Vol. 122
  • No. 13

View options

PDF format

Download this article as a PDF file

DOWNLOAD PDF

Figures

Tables

Media

Share

Share

Share article link

Share on social media