Large-scale distributed linear algebra with tensor processing units
Edited by David Weitz, Harvard University, Cambridge, MA; received December 16, 2021; accepted June 29, 2022
Significance
We demonstrate that Google TPUs, novel computer chips optimized for artificial intelligence, can also efficiently perform scientific computation at massive scale. The concrete examples of matrix multiplication, QR factorization, linear solution, and application of matrix functions are benchmarked upon matrices of linear size up to , achieving throughput around 90% of the single-core value at the largest values of N considered. We believe TPUs and TPU-like architectures will soon play a role in scientific computation analogous to the GPU but at larger scale, and mean for this study to catalyze the process.
Abstract
We have repurposed Google tensor processing units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs’ fast intercore interconnects (ICIs), physically two-dimensional network topology, and high-bandwidth memory (HBM) permit distributed matrix multiplication algorithms to rapidly become computationally bound. In this regime, the matrix-multiply units (MXUs) dominate the runtime, yielding impressive scaling, performance, and raw size: Operating in float32 precision, a full 2,048-core pod of third-generation TPUs can multiply two matrices with linear size in about 2 min. Via curated algorithms emphasizing large, single-core matrix multiplications, other tasks in dense linear algebra can similarly scale. As examples, we present 1) QR decomposition; 2) resolution of linear systems; and 3) the computation of matrix functions by polynomial iteration, demonstrated by the matrix polar factorization.
3.1. Data Availability
Some study data are available. (At present, we are not legally able to share the computer code implementing the described functionality. The benchmark data are essentially completely specified in the manuscript, but could be included in tabulated form if desired.)
Acknowledgments
This work would not have been possible without the at-times-heroic support from the Google teams associated with JAX and with Cloud TPUs, including but not limited to Skye Wanderman-Milne, Rasmus Larsen, Peter Hawkins, Adam Paszke, Stephan Hoyer, Sameer Agarwal, Matthew Johnson, Zak Stone, and James Bradbury. We thank Chase Riley Roberts, Jae Yoo, Megan Durney, Stefan Leichenauer and the entire Sandbox@Alphabet team for early work, discussions, encouragement, and infrastructure support. This research was supported with Cloud TPUs from Google’s TPU Research Cloud. Sandbox is a team within the Alphabet family of companies, which includes Google, Verily, Waymo, X, and others. G.V. is a Canadian Institute For Advanced Research fellow in the Quantum Information Science Program and a Distinguished Visiting Research Chair at Perimeter Institute. Research at Perimeter Institute is supported by the Government of Canada through the Department of Innovation, Science and Economic Development and by the Province of Ontario through the Ministry of Research, Innovation and Science.
References
1
Y. Bar Sinai, S. Hoyer, J. Hickey, M. Brenner, Learning data-driven discretizations for partial differential equations. Proc. Natl. Acad. Sci. U.S.A. 116, 15344–15349 (2019).
2
A. Bashir et al., Machine learning guided aptamer refinement and discovery. Nat. Commun. 12, 2366 (2021).
3
D. Kochkov et al., Machine learning-accelerated computational fluid dynamics. Proc. Natl. Acad. Sci. U.S A. 118, e2101784118 (2021).
4
L. Li et al., Kohn-Sham equations as regularizer: Building prior knowledge into machine-learned physics. Phys. Rev. Lett. 126, 036401 (2021).
5
T. Lu, Y. F. Chen, B. Hechtman, T. Wang, J. Anderson, Large-scale discrete Fourier transform on TPUs. arXiv [Preprint] (2020). https://doi.org/10.48550/arXiv.2002.03260. Accessed 26 July 2022.
6
C. Ma, T. Marin, T. Lu, Y.-F. Chen, Y. Zhuo, “Nonuniform fast Fourier transform on TPUs” in 2021 IEEE 18th International Symposium on Biomedial Imaging (ISBI), (Curran Associates, 2021), pp. 783–787.
7
K. Yang, Y. F. Chen, G. Roumpos, C. Colby, J. Anderson, “High performance Monte Carlo simulation of Ising model on TPU clusters” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19 (Association for Computing Machinery, New York, NY, 2019).
8
F. Huot, Y.-F. Chen, R. Clapp, C. Boneti, J. Anderson, High -resolution imaging on TPUs. arXiv [Preprint] (2019). https://doi.org/10.48550/arxiv.1912.08063. Accessed 26 July 2022.
9
E Gustafson, et al., Large scale multi-node simulations of gauge theory quantum circuits using Google Cloud platform. arXiv [Prepint] (2021). https://doi.org/10.48550/arxiv.2110.07482. Accessed 26 July 2022.
10
M Hauru, et al., Simulation of quantum physics with Tensor Processing Units: Brute-force computation of ground states and time evolution (2021).
11
A Morningstar, et al., Simulation of quantum many-body dynamics with Tensor Processing Units: Floquet prethermalization. PRX Quantum 3, 020331 (2022).
12
Ryan Pederson et al., Tensor Processing Units for Quantum Chemistry (Sandbox@Alphabet, in preparation.) https://doi.org/10.48550/arXiv.2202.01255. Accessed 26 July 2022.
13
John Kozlowski et al., Acceleration and scaling of Couple Cluster methods with Tensor Processing Units (Sandbox@Alphabet, in preparation.) (year?).
14
Martin Ganahl et al., Density Matrix Renormalization Group using Tensor Processing Units (Sandbox@Alphabet, in preparation.) https://arxiv.org/pdf/2204.05693.pdf. Accessed 26 July 2022.
15
R. A. Van De Geijn, J. Watts, SUMMA: Scalable universal matrix multiplication algorithm. Concurr. Pract. Exp 9, 255–274 (1997).
16
J. Demmel, L. Grigori, M. Hoemmen, J. Langou, Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34, A206–A239 (2012).
17
N. Jouppi et al., A domain-specific supercomputer for training deep neural networks. Commun. ACM 63, 67–78 (2020).
18
Anonymous, XLA: Optimizing compiler for machine learners. https://tensorflow.org/xla. Accessed 26 July 2022.
19
J Bradbury et al., JAX: composable transformations of Python+NumPy programs (2018). https://jax.readthedocs.io/en/latest/index.html. Accessed 26 July 2022.
20
G. H. Golub, C. F. V. Loan, Matrix Computations. (The John Hopkins University Press, Baltimore, MD, ed. 4, 2013).
21
G. Ballard et al., “Reconstructing Householder vectors from tall-skinny qr” in 2014 IEEE 28th International Parallel and Distributed Processing Symposium (Institute of Electrical and Electronics Engineers, 2014), pp. 1159–1170.
22
Y. Nakatsukasa, N. J. Higham, Backward stability of iterations for computing the polar decomposition. SIAM J. Matrix Anal. Appl. 33, 460–479 (2012).
23
A. Gil, J. Segura, N. Temme, Numerical Methods for Special Functions (Society for Industrial and Applied Mathematics, 2007).
24
A. Haidar, S. Tomov, J. Dongarra, N. J. Higham, “Harnessing GPU Tensor Cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers” in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (Association for Computing Machinery, New York, NY, 2018), pp. 603–613.
25
Y. Nakatsukasa, N. Higham, Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comput. 35, A1325–A1349 (2013).
Information & Authors
Information
Published in
Classifications
Copyright
Copyright © 2022 the Author(s). Published by PNAS. This article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).
3.1. Data Availability
Some study data are available. (At present, we are not legally able to share the computer code implementing the described functionality. The benchmark data are essentially completely specified in the manuscript, but could be included in tabulated form if desired.)
Submission history
Received: December 16, 2021
Accepted: June 29, 2022
Published online: August 8, 2022
Published in issue: August 16, 2022
Keywords
Acknowledgments
This work would not have been possible without the at-times-heroic support from the Google teams associated with JAX and with Cloud TPUs, including but not limited to Skye Wanderman-Milne, Rasmus Larsen, Peter Hawkins, Adam Paszke, Stephan Hoyer, Sameer Agarwal, Matthew Johnson, Zak Stone, and James Bradbury. We thank Chase Riley Roberts, Jae Yoo, Megan Durney, Stefan Leichenauer and the entire Sandbox@Alphabet team for early work, discussions, encouragement, and infrastructure support. This research was supported with Cloud TPUs from Google’s TPU Research Cloud. Sandbox is a team within the Alphabet family of companies, which includes Google, Verily, Waymo, X, and others. G.V. is a Canadian Institute For Advanced Research fellow in the Quantum Information Science Program and a Distinguished Visiting Research Chair at Perimeter Institute. Research at Perimeter Institute is supported by the Government of Canada through the Department of Innovation, Science and Economic Development and by the Province of Ontario through the Ministry of Research, Innovation and Science.
Notes
*We consider TPU network topologies with a 2:1 aspect ratio, so that local blocks of square matrices have a corresponding 1:2 aspect ratio.
Authors
Metrics & Citations
Metrics
Citation statements
Altmetrics
Citations
Cite this article
119 (33) e2122762119,
Export the article citation data by selecting a format from the list below and clicking Export.
Cited by
Loading...
View Options
View options
PDF format
Download this article as a PDF file
DOWNLOAD PDFLogin options
Check if you have access through your login credentials or your institution to get full access on this article.
Personal login Institutional LoginRecommend to a librarian
Recommend PNAS to a LibrarianPurchase options
Purchase this article to access the full text.