Large-scale distributed linear algebra with tensor processing units

Edited by David Weitz, Harvard University, Cambridge, MA; received December 16, 2021; accepted June 29, 2022
August 8, 2022
119 (33) e2122762119

Significance

We demonstrate that Google TPUs, novel computer chips optimized for artificial intelligence, can also efficiently perform scientific computation at massive scale. The concrete examples of matrix multiplication, QR factorization, linear solution, and application of matrix functions are benchmarked upon matrices of linear size up to N=1,048,576, achieving throughput around 90% of the single-core value at the largest values of N considered. We believe TPUs and TPU-like architectures will soon play a role in scientific computation analogous to the GPU but at larger scale, and mean for this study to catalyze the process.

Abstract

We have repurposed Google tensor processing units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs’ fast intercore interconnects (ICIs), physically two-dimensional network topology, and high-bandwidth memory (HBM) permit distributed matrix multiplication algorithms to rapidly become computationally bound. In this regime, the matrix-multiply units (MXUs) dominate the runtime, yielding impressive scaling, performance, and raw size: Operating in float32 precision, a full 2,048-core pod of third-generation TPUs can multiply two matrices with linear size N=220=1,048,576 in about 2 min. Via curated algorithms emphasizing large, single-core matrix multiplications, other tasks in dense linear algebra can similarly scale. As examples, we present 1) QR decomposition; 2) resolution of linear systems; and 3) the computation of matrix functions by polynomial iteration, demonstrated by the matrix polar factorization.

Continue Reading

3.1. Data Availability

Some study data are available. (At present, we are not legally able to share the computer code implementing the described functionality. The benchmark data are essentially completely specified in the manuscript, but could be included in tabulated form if desired.)

Acknowledgments

This work would not have been possible without the at-times-heroic support from the Google teams associated with JAX and with Cloud TPUs, including but not limited to Skye Wanderman-Milne, Rasmus Larsen, Peter Hawkins, Adam Paszke, Stephan Hoyer, Sameer Agarwal, Matthew Johnson, Zak Stone, and James Bradbury. We thank Chase Riley Roberts, Jae Yoo, Megan Durney, Stefan Leichenauer and the entire Sandbox@Alphabet team for early work, discussions, encouragement, and infrastructure support. This research was supported with Cloud TPUs from Google’s TPU Research Cloud. Sandbox is a team within the Alphabet family of companies, which includes Google, Verily, Waymo, X, and others. G.V. is a Canadian Institute For Advanced Research fellow in the Quantum Information Science Program and a Distinguished Visiting Research Chair at Perimeter Institute. Research at Perimeter Institute is supported by the Government of Canada through the Department of Innovation, Science and Economic Development and by the Province of Ontario through the Ministry of Research, Innovation and Science.

References

1
Y. Bar Sinai, S. Hoyer, J. Hickey, M. Brenner, Learning data-driven discretizations for partial differential equations. Proc. Natl. Acad. Sci. U.S.A. 116, 15344–15349 (2019).
2
A. Bashir et al., Machine learning guided aptamer refinement and discovery. Nat. Commun. 12, 2366 (2021).
3
D. Kochkov et al., Machine learning-accelerated computational fluid dynamics. Proc. Natl. Acad. Sci. U.S A. 118, e2101784118 (2021).
4
L. Li et al., Kohn-Sham equations as regularizer: Building prior knowledge into machine-learned physics. Phys. Rev. Lett. 126, 036401 (2021).
5
T. Lu, Y. F. Chen, B. Hechtman, T. Wang, J. Anderson, Large-scale discrete Fourier transform on TPUs. arXiv [Preprint] (2020). https://doi.org/10.48550/arXiv.2002.03260. Accessed 26 July 2022.
6
C. Ma, T. Marin, T. Lu, Y.-F. Chen, Y. Zhuo, “Nonuniform fast Fourier transform on TPUs” in 2021 IEEE 18th International Symposium on Biomedial Imaging (ISBI), (Curran Associates, 2021), pp. 783–787.
7
K. Yang, Y. F. Chen, G. Roumpos, C. Colby, J. Anderson, “High performance Monte Carlo simulation of Ising model on TPU clusters” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19 (Association for Computing Machinery, New York, NY, 2019).
8
F. Huot, Y.-F. Chen, R. Clapp, C. Boneti, J. Anderson, High -resolution imaging on TPUs. arXiv [Preprint] (2019). https://doi.org/10.48550/arxiv.1912.08063. Accessed 26 July 2022.
9
E Gustafson, et al., Large scale multi-node simulations of Z2 gauge theory quantum circuits using Google Cloud platform. arXiv [Prepint] (2021). https://doi.org/10.48550/arxiv.2110.07482. Accessed 26 July 2022.
10
M Hauru, et al., Simulation of quantum physics with Tensor Processing Units: Brute-force computation of ground states and time evolution (2021).
11
A Morningstar, et al., Simulation of quantum many-body dynamics with Tensor Processing Units: Floquet prethermalization. PRX Quantum 3, 020331 (2022).
12
Ryan Pederson et al., Tensor Processing Units for Quantum Chemistry (Sandbox@Alphabet, in preparation.) https://doi.org/10.48550/arXiv.2202.01255. Accessed 26 July 2022.
13
John Kozlowski et al., Acceleration and scaling of Couple Cluster methods with Tensor Processing Units (Sandbox@Alphabet, in preparation.) (year?).
14
Martin Ganahl et al., Density Matrix Renormalization Group using Tensor Processing Units (Sandbox@Alphabet, in preparation.) https://arxiv.org/pdf/2204.05693.pdf. Accessed 26 July 2022.
15
R. A. Van De Geijn, J. Watts, SUMMA: Scalable universal matrix multiplication algorithm. Concurr. Pract. Exp 9, 255–274 (1997).
16
J. Demmel, L. Grigori, M. Hoemmen, J. Langou, Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34, A206–A239 (2012).
17
N. Jouppi et al., A domain-specific supercomputer for training deep neural networks. Commun. ACM 63, 67–78 (2020).
18
Anonymous, XLA: Optimizing compiler for machine learners. https://tensorflow.org/xla. Accessed 26 July 2022.
19
J Bradbury et al., JAX: composable transformations of Python+NumPy programs (2018). https://jax.readthedocs.io/en/latest/index.html. Accessed 26 July 2022.
20
G. H. Golub, C. F. V. Loan, Matrix Computations. (The John Hopkins University Press, Baltimore, MD, ed. 4, 2013).
21
G. Ballard et al., “Reconstructing Householder vectors from tall-skinny qr” in 2014 IEEE 28th International Parallel and Distributed Processing Symposium (Institute of Electrical and Electronics Engineers, 2014), pp. 1159–1170.
22
Y. Nakatsukasa, N. J. Higham, Backward stability of iterations for computing the polar decomposition. SIAM J. Matrix Anal. Appl. 33, 460–479 (2012).
23
A. Gil, J. Segura, N. Temme, Numerical Methods for Special Functions (Society for Industrial and Applied Mathematics, 2007).
24
A. Haidar, S. Tomov, J. Dongarra, N. J. Higham, “Harnessing GPU Tensor Cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers” in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (Association for Computing Machinery, New York, NY, 2018), pp. 603–613.
25
Y. Nakatsukasa, N. Higham, Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comput. 35, A1325–A1349 (2013).

Information & Authors

Information

Published in

Go to Proceedings of the National Academy of Sciences
Proceedings of the National Academy of Sciences
Vol. 119 | No. 33
August 16, 2022
PubMed: 35939669

Classifications

3.1. Data Availability

Some study data are available. (At present, we are not legally able to share the computer code implementing the described functionality. The benchmark data are essentially completely specified in the manuscript, but could be included in tabulated form if desired.)

Submission history

Received: December 16, 2021
Accepted: June 29, 2022
Published online: August 8, 2022
Published in issue: August 16, 2022

Keywords

  1. TPUs
  2. scientific computation
  3. linear algebra
  4. distributed computing
  5. ASICs

Acknowledgments

This work would not have been possible without the at-times-heroic support from the Google teams associated with JAX and with Cloud TPUs, including but not limited to Skye Wanderman-Milne, Rasmus Larsen, Peter Hawkins, Adam Paszke, Stephan Hoyer, Sameer Agarwal, Matthew Johnson, Zak Stone, and James Bradbury. We thank Chase Riley Roberts, Jae Yoo, Megan Durney, Stefan Leichenauer and the entire Sandbox@Alphabet team for early work, discussions, encouragement, and infrastructure support. This research was supported with Cloud TPUs from Google’s TPU Research Cloud. Sandbox is a team within the Alphabet family of companies, which includes Google, Verily, Waymo, X, and others. G.V. is a Canadian Institute For Advanced Research fellow in the Quantum Information Science Program and a Distinguished Visiting Research Chair at Perimeter Institute. Research at Perimeter Institute is supported by the Government of Canada through the Department of Innovation, Science and Economic Development and by the Province of Ontario through the Ministry of Research, Innovation and Science.

Notes

*We consider TPU network topologies with a 2:1 aspect ratio, so that local blocks of square matrices have a corresponding 1:2 aspect ratio.

Authors

Affiliations

Simulation & Optimization Team, Sandbox AQ, Palo Alto, CA 94301;
Sandbox Alphabet X, The Moonshot Factory, Mountain View, CA 94043;
Jackson Beall
Simulation & Optimization Team, Sandbox AQ, Palo Alto, CA 94301;
Sandbox Alphabet X, The Moonshot Factory, Mountain View, CA 94043;
Martin Ganahl
Simulation & Optimization Team, Sandbox AQ, Palo Alto, CA 94301;
Sandbox Alphabet X, The Moonshot Factory, Mountain View, CA 94043;
Sandbox Alphabet X, The Moonshot Factory, Mountain View, CA 94043;
Shrestha Basu Mallick
Sandbox Alphabet X, The Moonshot Factory, Mountain View, CA 94043;
Guifre Vidal
Sandbox Alphabet X, The Moonshot Factory, Mountain View, CA 94043;
Google Quantum AI, Google LLC, Santa Barbara, CA 93111

Notes

1
To whom correspondence may be addressed. Email: [email protected].
Author contributions: A.G.M.L., M.G., M.H., S.B.M., and G.V. designed research; A.G.M.L., J.B., M.G., M.H., S.B.M., and G.V. performed research; A.G.M.L. and G.V. analyzed data; A.G.M.L. wrote the paper; and A.G.M.L., J.B., M.G., and M.H. wrote code.
A.G.M.L., J.B., M.G., M.H., S.B.M., and G.V. performed this research while employees of Alphabet Inc, the proprietor of the TPU architecture.

Metrics & Citations

Metrics

Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.


Citation statements

Altmetrics

Citations

Export the article citation data by selecting a format from the list below and clicking Export.

Cited by

    Loading...

    View Options

    View options

    PDF format

    Download this article as a PDF file

    DOWNLOAD PDF

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Personal login Institutional Login

    Recommend to a librarian

    Recommend PNAS to a Librarian

    Purchase options

    Purchase this article to access the full text.

    Single Article Purchase

    Large-scale distributed linear algebra with tensor processing units
    Proceedings of the National Academy of Sciences
    • Vol. 119
    • No. 33

    Media

    Figures

    Tables

    Other

    Share

    Share

    Share article link

    Share on social media