# Inference and uncertainty quantification for noisy matrix completion

Edited by Peter J. Bickel, University of California, Berkeley, CA, and approved October 8, 2019 (received for review June 11, 2019)

## Significance

Matrix completion finds numerous applications in data science, ranging from information retrieval to medical imaging. While substantial progress has been made in designing estimation algorithms, it remains unknown how to perform optimal statistical inference on the unknown matrix given the obtained estimates—a task at the core of modern decision making. We propose procedures to debias the popular convex and nonconvex estimators and derive distributional characterizations for the resulting debiased estimators. This distributional theory enables valid inference on the unknown matrix. Our procedures 1) yield optimal construction of confidence intervals for missing entries and 2) achieve optimal estimation accuracy in a sharp manner.

## Abstract

Noisy matrix completion aims at estimating a low-rank matrix given only partial and corrupted entries. Despite remarkable progress in designing efficient estimation algorithms, it remains largely unclear how to assess the uncertainty of the obtained estimates and how to perform efficient statistical inference on the unknown matrix (e.g., constructing a valid and short confidence interval for an unseen entry). This paper takes a substantial step toward addressing such tasks. We develop a simple procedure to compensate for the bias of the widely used convex and nonconvex estimators. The resulting debiased estimators admit nearly precise nonasymptotic distributional characterizations, which in turn enable optimal construction of confidence intervals/regions for, say, the missing entries and the low-rank factors. Our inferential procedures do not require sample splitting, thus avoiding unnecessary loss of data efficiency. As a byproduct, we obtain a sharp characterization of the estimation accuracy of our debiased estimators in both rate and constant. Our debiased estimators are tractable algorithms that provably achieve full statistical efficiency.

### Sign up for PNAS alerts.

Get alerts for new articles, or get an alert when an article is cited.

Low-rank matrix completion is concerned with recovering a low-rank matrix, when only a small fraction of its entries are revealed (1–3). The importance of this problem cannot be overstated, due to its broad applications in, e.g., recommendation systems, sensor network localization, magnetic resonance imaging, computer vision, large covariance estimation, and latent factor learning to name just a few. Tackling this problem in large-scale applications is computationally challenging, resulting from the intrinsic nonconvexity incurred by the low-rank structure. To further complicate matters, another inevitable challenge stems from the imperfectness of data acquisition mechanisms, wherein the acquired samples are usually contaminated by a certain amount of noise.

Fortunately, if the entries of the unknown matrix are sufficiently delocalized and randomly revealed, this problem may not be as hard as it seems. Substantial progress has been made over the past several years in designing computationally tractable algorithms—including both convex and nonconvex approaches— that allow one to fill in unseen entries faithfully from partial noisy samples (4–13). Nevertheless, modern decision making would often require one step further. It not merely anticipates a faithful estimate, but also seeks to quantify the uncertainty or “confidence” of the provided estimate, ideally in a reasonably accurate fashion. For instance, given an estimate returned by the convex approach, how does one use it to identify a short interval that is likely to contain a missing entry?

Conducting effective uncertainty quantification for noisy matrix completion is, however, far from straightforward. For the most part, the state-of-the-art matrix completion algorithms require solving highly complex optimization problems, which often do not admit closed-form solutions. Of necessity, it is extremely challenging to pin down the distributions of the estimates returned by these algorithms. The lack of distributional characterizations presents a major roadblock to performing valid, yet efficient, statistical inference on the unknown matrix of interest.

It is worth noting that a number of recent papers have been dedicated to inference and uncertainty quantification for various high-dimensional problems, including Lasso (14–18), generalized linear models (17, 19), and graphical models (20, 21), among others. Very little work, however, has looked into noisy matrix completion along this direction. While nonasymptotic statistical guarantees for noisy matrix completion have been derived in prior theory, the existing estimation error bounds are supplied only at an order-wise level. Such order-wise error bounds either lose a significant factor relative to the optimal guarantees or come with an unspecified (but often enormous) preconstant. Viewed in this light, a confidence region constructed directly based on such results is bound to be overly conservative, resulting in substantial overcoverage.

## A Glimpse of Our Main Contributions

This paper takes a substantial step toward statistically optimal inference and uncertainty quantification for noisy matrix completion. Specifically, we develop a simple procedure to compensate for the bias of the widely used convex and nonconvex estimators. The resulting debiased estimators admit nearly accurate nonasymptotic distributional guarantees. Such distributional characterizations in turn allow us to reason about the uncertainty of the obtained estimates vis-à-vis the unknown matrix. For instance, we can construct 1) confidence intervals for each entry—either observed or missing—of the unknown matrix and 2) confidence regions for the low-rank factors of interest (modulo some global ambiguity), both of which are provably optimal. As a byproduct, we characterize the Euclidean estimation errors of the proposed debiased estimators, which match statistical efficiency precisely (including the preconstant). This theory demonstrates that a computationally feasible algorithm can achieve the best possible statistical efficiency (including the preconstant) for noisy matrix completion.

## Models and Notation

To cast noisy matrix completion in concrete statistical settings, we adopt a model commonly studied in the literature.

### Ground Truth.

We are interested in estimating an unknown rank-$r$ matrix ${\mathit{M}}^{\star}\in {\mathbb{R}}^{n\times n}$,Denote by ${\sigma}_{i}\left({\mathit{M}}^{\star}\right)$ the $i$th largest singular value of ${\mathit{M}}^{\star}$. Set

^{†}whose rank-$r$ singular-value decomposition (SVD) is given by ${\mathit{M}}^{\star}={\mathit{U}}^{\star}{\mathbf{\Sigma}}^{\star}{\mathit{V}}^{\star \top}$. For convenience, let ${\mathit{X}}^{\star}\triangleq {\mathit{U}}^{\star}{\mathbf{\Sigma}}^{\star 1/2}\in {\mathbb{R}}^{n\times r}$ and ${\mathit{Y}}^{\star}\triangleq {\mathit{V}}^{\star}{\mathbf{\Sigma}}^{\star 1/2}\in {\mathbb{R}}^{n\times r}$ be the balanced low-rank factors of ${\mathit{M}}^{\star}$, which obey$${\mathit{X}}^{\star \top}{\mathit{X}}^{\star}={\mathit{Y}}^{\star \top}{\mathit{Y}}^{\star}={\mathbf{\Sigma}}^{\star}\hspace{1em}\text{and}\hspace{1em}{\mathit{M}}^{\star}={\mathit{X}}^{\star}{\mathit{Y}}^{\star \top}.$$

[1]

$${\sigma}_{max}\triangleq {\sigma}_{1}\left({\mathit{M}}^{\star}\right),{\sigma}_{min}\triangleq {\sigma}_{r}\left({\mathit{M}}^{\star}\right),\text{and}\hspace{0.17em}\kappa \triangleq {\sigma}_{max}/{\sigma}_{min}.$$

[2]

### Observation Models.

What we observe is a randomly subsampled and corrupted subset of the entries of ${\mathit{M}}^{\star}$; namely,where $\mathrm{\Omega}\subseteq \left\{1,\cdots ,n\right\}\times \left\{1,\cdots ,n\right\}$ is a small set of indexes, and ${E}_{ij}$ denotes independently generated noise at the location $\left(i,j\right)$. From now on, we assume the random sampling model where each index $\left(i,j\right)$ is included in $\mathrm{\Omega}$ independently with probability $p$ (i.e., data are missing uniformly at random). We use ${\mathcal{P}}_{\mathrm{\Omega}}\left(\cdot \right)$ to represent the orthogonal projection onto the subspace of matrices that vanish outside the index set $\mathrm{\Omega}$.

$${M}_{ij}={M}_{ij}^{\star}+{E}_{ij},{E}_{ij}\stackrel{\mathrm{i}.\mathrm{i}.\mathrm{d}.}{\sim}\mathcal{N}\left(0,{\sigma}^{2}\right),\hspace{1em}\text{for all}\left(i,j\right)\in \mathrm{\Omega},$$

[3]

### Incoherence Conditions.

Clearly, not all matrices can be reliably estimated from a highly incomplete set of measurements. To address this issue, we impose a standard incoherence condition (2) on the singular subspaces of ${\mathit{M}}^{\star}$ (i.e., ${\mathit{U}}^{\star}$ and ${\mathit{V}}^{\star}$),where $\mu $ is termed the incoherence parameter and ${\Vert \mathit{A}\Vert}_{2,\infty}$ denotes the largest ${\ell}_{2}$ norm of all rows in $\mathit{A}$. A small $\mu $ implies that the energies of ${\mathit{U}}^{\star}$ and ${\mathit{V}}^{\star}$ are reasonably spread out across all of their rows.

$$max\left\{{\u2225{\mathit{U}}^{\star}\u2225}_{2,\infty},{\u2225{\mathit{V}}^{\star}\u2225}_{2,\infty}\right\}\le \sqrt{\mu r/n},$$

[4]

### Asymptotic Notation.

$f\left(n\right)\lesssim h\left(n\right)$ (or $f\left(n\right)=O\left(h\left(n\right)\right)$) means $\left|f\left(n\right)\right|\le {c}_{1}\left|h\left(n\right)\right|$ for some constant ${c}_{1}>0$, $f\left(n\right)\gtrsim h\left(n\right)$ means $\left|f\left(n\right)\right|\ge {c}_{2}\left|h\left(n\right)\right|$ for some constant ${c}_{2}>0$, $f\left(n\right)\asymp h\left(n\right)$ means ${c}_{2}\left|h\left(n\right)\right|\le \left|f\left(n\right)\right|\le {c}_{1}\left|h\left(n\right)\right|$ for some constants ${c}_{1},{c}_{2}>0$, and $f\left(n\right)=o\left(h\left(n\right)\right)$ means $\underset{n\to \infty}{lim}f\left(n\right)/h\left(n\right)=0$.

## Inferential Procedures and Main Results

The proposed inferential procedure has its basis on 2 of the most popular matrix completion paradigms: convex relaxation and nonconvex optimization. Recognizing the complicated bias of these 2 highly nonlinear estimators and motivated by refs. 14, 15, and 17, we first illustrate how to perform bias correction, followed by a theory that establishes the near-Gaussianity and optimality of the proposed debiased estimators.

Algorithm 1.

**7**

Suitable initialization: ${\mathbf{X}}^{0}$, ${\mathbf{Y}}^{0}$ (SI Appendix) |

Gradient updates: for $t=\mathrm{0,1},\dots ,{t}_{0}-1$ do |

$${\mathbf{X}}^{t+1}={\mathbf{X}}^{t}-\frac{\eta}{p}\left[{\mathcal{P}}_{\mathrm{\Omega}}\left({\mathbf{X}}^{t}{\mathbf{Y}}^{t\top}-\mathbf{M}\right){\mathbf{Y}}^{t}+\lambda {\mathbf{X}}^{t}\right],$$ [6a] |

$${\mathbf{Y}}^{t+1}={\mathbf{Y}}^{t}-\frac{\eta}{p}\left[{\left[{\mathcal{P}}_{\mathrm{\Omega}}\left({\mathbf{X}}^{t}{\mathbf{Y}}^{t\top}-\mathbf{M}\right)\right]}^{\top}{\mathbf{X}}^{t}+\lambda {\mathbf{Y}}^{t}\right],$$ [6b] |

where $\eta >0$ determines the step size or the learning rate. |

### Background: Convex and Nonconvex Estimation Algorithms.

We first review in passing 2 computationally feasible estimation algorithms that are arguably the most widely used in practice. They serve as the starting point for us to design inferential procedures for noisy low-rank matrix completion.

### Convex Relaxation.

Recall that the rank function is highly nonconvex, which often prevents us from computing a rank-constrained estimator in polynomial time. For the sake of computational feasibility, prior works suggest relaxing the rank function into its convex surrogate (22); for example, one can consider a penalized least-squares convex programHere, ${\Vert \cdot \Vert}_{*}$ is the nuclear norm (the sum of singular values, which is a convex surrogate of the rank function), and $\lambda >0$ is some regularization parameter. Under mild conditions, the solution to the convex program Eq.

$$\underset{\mathit{Z}\in {\mathbb{R}}^{n\times n}}{\mathrm{minimize}}\hspace{1em}\frac{1}{2}{\u2225{\mathcal{P}}_{\mathrm{\Omega}}\left(\mathit{Z}-\mathit{M}\right)\u2225}_{\mathrm{F}}^{2}+\lambda {\Vert \mathit{Z}\Vert}_{*}.$$

[5]

**5**attains appealing estimation accuracy (in an order-wise sense), provided that a proper regularization parameter $\lambda $ is adopted (4, 13).### Nonconvex Optimization.

It is recognized that the convex approach, which typically relies on solving a semidefinite program, is still expensive and not scalable to large dimensions. This motivates an alternative route, which represents the matrix variable via 2 low-rank factors $\mathit{X},\mathit{Y}\in {\mathbb{R}}^{n\times r}$ and attempts solving the following nonconvex program directly:Here, we choose a regularizer of the form $0.5\lambda \left({\Vert \mathit{X}\Vert}_{\mathrm{F}}^{2}+{\Vert \mathit{Y}\Vert}_{\mathrm{F}}^{2}\right)$ primarily to mimic the nuclear norm $\lambda {\Vert \mathit{Z}\Vert}_{*}$ (23, 24). A variety of optimization algorithms have been proposed to tackle the nonconvex program Eq.

$$\underset{\mathit{X},\mathit{Y}\in {\mathbb{R}}^{n\times r}}{\mathrm{minimize}}\hspace{1em}\frac{1}{2}{\u2225{\mathcal{P}}_{\mathrm{\Omega}}\left(\mathit{X}{\mathit{Y}}^{\top}-\mathit{M}\right)\u2225}_{\mathrm{F}}^{2}+\frac{\lambda}{2}{\Vert \mathit{X}\Vert}_{\mathrm{F}}^{2}+\frac{\lambda}{2}{\Vert \mathit{Y}\Vert}_{\mathrm{F}}^{2}.$$

[7]

**7**or its variants (7, 10, 11, 25); readers are referred to ref. 26 for a recent overview. As a prominent example, a 2-stage algorithm—gradient descent following suitable initialization—provably enjoys fast convergence for a wide range of scenarios (11, 13). The present paper focuses on this simple yet powerful algorithm, as documented in*Algorithm 1*and detailed in*SI Appendix*.### Intimate Connections between Convex and Nonconvex Estimates.

Denote by ${\mathit{Z}}^{\mathrm{cvx}}$ any minimizer of the convex program Eq. In truth, the 2 matrices in Eq.

**5**and $\left({\mathit{X}}^{\hspace{0.17em}\mathrm{ncvx}},{\mathit{Y}}^{\hspace{0.17em}\mathrm{ncvx}}\right)$ the estimate returned by*Algorithm 1*aimed at solving Eq.**7**. As was recently shown in ref. 13, when the regularization parameter $\lambda $ is properly chosen, these 2 estimates provably obey (see*SI Appendix*for a precise statement)$${\mathit{X}}^{\mathrm{ncvx}}{\mathit{Y}}^{\hspace{0.17em}\mathrm{ncvx}\top}\approx {\mathit{Z}}^{\mathrm{cvx}}.$$

[8]

**8**are exceedingly close to, if not identical with, each other. This salient feature paves the way for a unified treatment of convex and nonconvex approaches: Most inferential procedures and guarantees developed for the nonconvex estimate can be readily transferred to perform inference for the convex one, and vice versa.### Constructing Debiased Estimators.

We are now well equipped to describe how to construct estimators based on the convex estimate ${\mathit{Z}}^{\mathrm{cvx}}$ and the nonconvex estimate $\left({\mathit{X}}^{\hspace{0.17em}\mathrm{ncvx}},{\mathit{Y}}^{\hspace{0.17em}\mathrm{ncvx}}\right)$, to enable efficient inference. Motivated by the proximity of the convex and nonconvex estimates and for the sake of conciseness, we abuse notation by using $\mathit{Z},\mathit{X},\mathit{Y}$ for both convex and nonconvex estimates; see Table 1 and

*SI Appendix*for precise definitions. This allows us to unify the presentation for both convex and nonconvex estimators.Table 1.

$\mathbf{Z}\in {\mathbb{R}}^{n\times n}$ | Either ${\mathbf{Z}}^{\mathrm{cvx}}$ or ${\mathbf{X}}^{\hspace{0.17em}\mathrm{ncvx}}{\mathbf{Y}}^{\mathrm{ncvx}\top}$ |
---|---|

$\mathbf{X},\mathbf{Y}\in {\mathbb{R}}^{n\times r}$ | For the nonconvex case, we take $\mathbf{X}={\mathbf{X}}^{\hspace{0.17em}\mathrm{ncvx}}$ and $\mathbf{Y}={\mathbf{Y}}^{\hspace{0.17em}\mathrm{ncvx}}$; for the convex case, let $\mathbf{X}={\mathbf{X}}^{\mathrm{cvx}}$ and $\mathbf{Y}={\mathbf{Y}}^{\mathrm{cvx}}$, which are the balanced low-rank factors of ${\mathbf{Z}}^{\mathrm{cvx},r}$ obeying ${\mathbf{Z}}^{\mathrm{cvx},r}={\mathbf{X}}^{\mathrm{cvx}}{\mathbf{Y}}^{\mathrm{cvx}\top}$ and ${\mathbf{X}}^{\mathrm{cvx}\top}{\mathbf{X}}^{\mathrm{cvx}}={\mathbf{Y}}^{\mathrm{cvx}\top}{\mathbf{Y}}^{\mathrm{cvx}}$. |

${\mathbf{M}}^{\hspace{0.17em}\mathrm{d}}\in {\mathbb{R}}^{n\times n}$ | The proposed debiased estimator as in Eq. 10. |

${\mathbf{X}}^{\hspace{0.17em}\mathrm{d}},{\mathbf{Y}}^{\hspace{0.17em}\mathrm{d}}\in {\mathbb{R}}^{n\times r}$ | The proposed estimator as in Eq. 11. |

Here, ${\mathbf{Z}}^{\mathrm{cvx},r}={\mathcal{P}}_{\text{rank}-r}\left({\mathbf{Z}}^{\mathrm{cvx}}\right)$ is the best rank-$r$ approximation of ${\mathbf{Z}}^{\mathrm{cvx}}$. See

*SI Appendix*for a complete summary.Given that Eqs.

**5**and**7**are both regularized least-squares problems, they behave effectively like shrinkage estimators, indicating that the provided estimates necessarily suffer from nonnegligible bias. To enable the desired statistical inference, it is natural to first correct the estimation bias.#### A debiased estimator for the matrix.

A natural debiasing strategy that immediately comes to mind is the simple linear transformation (recall the notation in Table 1)where we identify ${\mathcal{P}}_{\mathrm{\Omega}}\left(\mathit{M}\right)$ with ${\mathcal{P}}_{\mathrm{\Omega}}\left({\mathit{M}}^{\star}\right)+{\mathcal{P}}_{\mathrm{\Omega}}\left(\mathit{E}\right)$. Heuristically, if $\mathrm{\Omega}$ and $\mathit{Z}$ are statistically independent, then ${\mathit{Z}}^{0}$ serves as an unbiased estimator of ${\mathit{M}}^{\star}$, i.e., $\mathbb{E}\left[{\mathit{Z}}^{0}\right]={\mathit{M}}^{\star}$; this arises since the noise $\mathit{E}$ has zero mean and $\mathbb{E}\left[{\mathcal{P}}_{\mathrm{\Omega}}\right]=p\mathcal{I}$ under the uniform random sampling model, with $\mathcal{I}$ the identity operator. Despite its (near) unbiasedness nature at a heuristic level, however, the matrix ${\mathit{Z}}^{0}$ is typically full rank, with nonnegligible energy spread across its entire spectrum. This results in dramatically increased variability in the estimate, which is undesirable for inferential purposes.

$$\begin{array}{ll}\hfill {\mathit{Z}}^{0}& \triangleq \mathit{Z}-{p}^{-1}{\mathcal{P}}_{\mathrm{\Omega}}\left(\mathit{Z}-\mathit{M}\right)\hfill \\ \hfill & =\underset{\text{mean:}\hspace{0.17em}{\mathit{M}}^{\star}}{\underbrace{{p}^{-1}{\mathcal{P}}_{\mathrm{\Omega}}\left({\mathit{M}}^{\star}\right)}}+\underset{\text{mean:}\hspace{0.17em}\mathbf{0}}{\underbrace{{p}^{-1}{\mathcal{P}}_{\mathrm{\Omega}}\left(\mathit{E}\right)}}+\underset{\text{mean:}\hspace{0.17em}\mathbf{0}\left(\text{heuristically}\right)}{\underbrace{\mathit{Z}-{p}^{-1}{\mathcal{P}}_{\mathrm{\Omega}}\left(\mathit{Z}\right)}},\hfill \end{array}$$

[9]

To remedy this issue, we propose to further project ${\mathit{Z}}^{0}$ onto the set of rank-$r$ matriceswhere ${\mathcal{P}}_{\text{rank}-r}\left(\mathit{B}\right)\triangleq \mathrm{arg}\underset{\mathit{A}:\text{rank}\left(\mathit{A}\right)\le r}{min}{\Vert \mathit{A}-\mathit{B}\Vert}_{\mathrm{F}}$, and $\mathit{Z}$ can again be found in Table 1. This projection step effectively suppresses the variability outside the $r$-dimensional principal subspace. As we shall demonstrate, the proposed estimator Eq.

^{‡}, leading to the estimator$${\mathit{M}}^{\hspace{0.17em}\mathrm{d}}\triangleq {\mathcal{P}}_{\text{rank}-r}\left[\mathit{Z}-\frac{1}{p}{\mathcal{P}}_{\mathrm{\Omega}}\left(\mathit{Z}-\mathit{M}\right)\right],$$

[10]

**10**provably debiases the provided estimate $\mathit{Z}$, while optimally controlling the extent of variability.#### An equivalent perspective on the low-rank factors.

As it turns out, the debiased estimator Eq. where we recall the definition of $\mathit{X}$ and $\mathit{Y}$ in Table 1. To develop some intuition, let us look at a simple scenario where $\mathit{U}\mathbf{\Sigma}{\mathit{V}}^{\top}$ is the rank-$r$ SVD of $\mathit{X}{\mathit{Y}}^{\top}$ and $\mathit{X}=\mathit{U}{\mathbf{\Sigma}}^{1/2}$, $\mathit{Y}=\mathit{V}{\mathbf{\Sigma}}^{1/2}$. It is then self-evident that ${\mathit{X}}^{\hspace{0.17em}\mathrm{d}}=\mathit{U}{\left(\mathbf{\Sigma}+\left(\lambda /p\right){\mathit{I}}_{r}\right)}^{1/2}$ and ${\mathit{Y}}^{\hspace{0.17em}\mathrm{d}}=\mathit{V}{\left(\mathbf{\Sigma}+\left(\lambda /p\right){\mathit{I}}_{r}\right)}^{1/2}$. In words, ${\mathit{X}}^{\hspace{0.17em}\mathrm{d}}$ and ${\mathit{Y}}^{\hspace{0.17em}\mathrm{d}}$ are obtained by deshrinking the spectrum of $\mathit{X}$ and $\mathit{Y}$ properly.

**10**admits another almost equivalent representation that offers further insights. Specifically, we consider the following estimator for the low-rank factors,$${\mathit{X}}^{\hspace{0.17em}\mathrm{d}}\triangleq \mathit{X}{\left({\mathit{I}}_{r}+{p}^{-1}\lambda \hspace{0.17em}{\left({\mathit{X}}^{\top}\mathit{X}\right)}^{-1}\right)}^{1/2},$$

[11a]

$${\mathit{Y}}^{\hspace{0.17em}\mathrm{d}}\triangleq \mathit{Y}{\left({\mathit{I}}_{r}+{p}^{-1}\lambda \hspace{0.17em}{\left({\mathit{Y}}^{\top}\mathit{Y}\right)}^{-1}\right)}^{1/2},$$

[11b]

As we formally establish in

*SI Appendix*, the estimator Eq.**11**for the low-rank factors is extremely close to the debiased estimator Eq.**10**for the whole matrix, in the sense that$${\mathit{M}}^{\hspace{0.17em}\mathrm{d}}\approx {\mathit{X}}^{\hspace{0.17em}\mathrm{d}}{\mathit{Y}}^{\mathrm{d}\top}.$$

[12]

### Main Results: Distributional Guarantees.

The proposed estimators admit tractable distributional characterizations in the large-$n$ regime, which facilitates the construction of confidence regions for many quantities of interest. In particular, this paper centers around 2 types of inferential problems:

1)

Each entry of the matrix ${\mathit{M}}^{\star}$: The entry can be either missing (i.e., predicting an unseen entry) or observed (i.e., denoising an observed entry). For example, in the problem of sensor localization (27), one wants to infer the distance between any 2 sensors, given partially revealed distances. Mathematically, this seeks to determine the distribution of

$${M}_{ij}^{\mathrm{d}}-{M}_{ij}^{\star},\hspace{1em}\text{for all}1\le i,j\le n.$$

[13]

2)

The low-rank factors ${\mathit{X}}^{\star},{\mathit{Y}}^{\star}\in {\mathbb{R}}^{n\times r}$: The low-rank factors often reveal critical information about the applications of interest [e.g., community memberships of each individual in the community detection problem (28), angles between each object and a global reference point in the angular synchronization problem (29), or factor loadings and latent factors in factor analysis (30)]. Recognizing the global rotational ambiguity issue,for the global rotation matrix ${\mathit{H}}^{\hspace{0.17em}\mathrm{d}}\in {\mathbb{R}}^{r\times r}$ that best “aligns” $\left({\mathit{X}}^{\hspace{0.17em}\mathrm{d}},{\mathit{Y}}^{\hspace{0.17em}\mathrm{d}}\right)$ and $\left({\mathit{X}}^{\star},{\mathit{Y}}^{\star}\right)$, i.e.,Here, ${\mathcal{O}}^{r\times r}$ is the set of orthonormal matrices in ${\mathbb{R}}^{r\times r}$.

^{§}we aim to pin down the distributions of ${\mathit{X}}^{\hspace{0.17em}\mathrm{d}}$ and ${\mathit{Y}}^{\hspace{0.17em}\mathrm{d}}$ up to global rotational ambiguity. More precisely, we intend to characterize the distributions of$${\mathit{X}}^{\hspace{0.17em}\mathrm{d}}{\mathit{H}}^{\hspace{0.17em}\mathrm{d}}-{\mathit{X}}^{\star}\hspace{1em}\text{and}\hspace{1em}{\mathit{Y}}^{\hspace{0.17em}\mathrm{d}}{\mathit{H}}^{\hspace{0.17em}\mathrm{d}}-{\mathit{Y}}^{\star}$$

[14]

$${\mathit{H}}^{\hspace{0.17em}\mathrm{d}}\triangleq \mathrm{arg}\underset{\mathit{R}\in {\mathcal{O}}^{r\times r}}{min}{\u2225{\mathit{X}}^{\hspace{0.17em}\mathrm{d}}\mathit{R}-{\mathit{X}}^{\star}\u2225}_{\mathrm{F}}^{2}+{\u2225{\mathit{Y}}^{\hspace{0.17em}\mathrm{d}}\mathit{R}-{\mathit{Y}}^{\star}\u2225}_{\mathrm{F}}^{2}.$$

[15]

Clearly, the above 2 inferential problems are tightly related: An accurate distributional characterization for the low-rank factors (Eq.

**14**) often results in a distributional guarantee for the entries (Eq.**13**).#### Distributional guarantees for low-rank factors.

We begin with our distributional characterizations of the low-rank factors. Here, ${\mathit{e}}_{i}$ denotes the $i$th standard basis vector in ${\mathbb{R}}^{n}$.

### Theorem 1.

*Suppose that the sample complexity meets*${n}^{2}p\ge C{\kappa}^{8}{\mu}^{3}{r}^{2}n{\mathrm{log}}^{3}n$

*for some sufficiently large constant*$C>0$

*and the noise obeys*$\sigma /{\sigma}_{min}\le c\sqrt{p/\left({\kappa}^{8}\mu rn{\mathrm{log}}^{2}n\right)}$

*for some sufficiently small constant*$c>0$

*. Then one can write*

$${\mathit{X}}^{\hspace{0.17em}\mathrm{d}}{\mathit{H}}^{\hspace{0.17em}\mathrm{d}}-{\mathit{X}}^{\star}={\mathit{Z}}_{\mathit{X}}+{\mathbf{\Psi}}_{\mathit{X}},$$

[16a]

$${\mathit{Y}}^{\hspace{0.17em}\mathrm{d}}{\mathit{H}}^{\hspace{0.17em}\mathrm{d}}-{\mathit{Y}}^{\star}={\mathit{Z}}_{\mathit{Y}}+{\mathbf{\Psi}}_{\mathit{Y}},$$

[16b]

*with*$\left({\mathit{X}}^{\star},{\mathit{Y}}^{\star}\right)$

*defined in*

*Eq*.

**1**, $\left({\mathit{X}}^{\hspace{0.17em}\mathrm{d}},{\mathit{Y}}^{\hspace{0.17em}\mathrm{d}}\right)$

*defined in*

*Table 1*,

*and*${\mathit{H}}^{\hspace{0.17em}\mathrm{d}}$

*defined in*

*Eq*.

**15**

*. Here*,

*the rows of*${\mathit{Z}}_{\mathit{X}}\in {\mathbb{R}}^{n\times r}$ (

*resp.*${\mathit{Z}}_{\mathit{Y}}\in {\mathbb{R}}^{n\times r}$)

*are independent and obey*

$${\mathit{Z}}_{\mathit{X}}^{\top}{\mathit{e}}_{j}\stackrel{\mathrm{i}.\mathrm{i}.\mathrm{d}.}{\sim}\mathcal{N}\left(\mathbf{0},\frac{{\sigma}^{2}}{p}{\left({\mathbf{\Sigma}}^{\star}\right)}^{-1}\right),\hspace{1em}\mathrm{f}\mathrm{o}\mathrm{r}\hspace{1em}1\le j\le n;$$

[17a]

$${\mathit{Z}}_{\mathit{Y}}^{\top}{\mathit{e}}_{j}\stackrel{\mathrm{i}.\mathrm{i}.\mathrm{d}.}{\sim}\mathcal{N}\left(\mathbf{0},\frac{{\sigma}^{2}}{p}{\left({\mathbf{\Sigma}}^{\star}\right)}^{-1}\right),\hspace{1em}\mathrm{f}\mathrm{o}\mathrm{r}\hspace{1em}1\le j\le n.$$

[17b]

*In addition*,

*the residual matrices*${\mathbf{\Psi}}_{\mathit{X}},{\mathbf{\Psi}}_{\mathit{Y}}\in {\mathbb{R}}^{n\times r}$

*satisfy*,

*with probability at least*$1-O\left({n}^{-3}\right)$,

*that*

$$max\left\{{\u2225{\mathbf{\Psi}}_{\mathit{X}}\u2225}_{2,\infty},{\u2225{\mathbf{\Psi}}_{\mathit{Y}}\u2225}_{2,\infty}\right\}=o\left(\frac{\sigma \sqrt{r}}{\sqrt{p{\sigma}_{max}}}\right).$$

[18]

In words, in other words, the typical size of the $j$th row of ${\mathit{Z}}_{\mathit{X}}$ is no smaller than the order of $\sigma \sqrt{r/\left(p{\sigma}_{max}\right)}$. In comparison, the size of each row of ${\mathbf{\Psi}}_{\mathit{X}}$ (Eq.

*Theorem 1*decomposes the estimation error ${\mathit{X}}^{\hspace{0.17em}\mathrm{d}}{\mathit{H}}^{\hspace{0.17em}\mathrm{d}}-{\mathit{X}}^{\star}$ (resp. ${\mathit{Y}}^{\hspace{0.17em}\mathrm{d}}{\mathit{H}}^{\hspace{0.17em}\mathrm{d}}-{\mathit{Y}}^{\star}$) into a Gaussian component ${\mathit{Z}}_{\mathit{X}}$ (resp. ${\mathit{Z}}_{\mathit{Y}}$) and a residual term ${\mathbf{\Psi}}_{\mathit{X}}$ (resp. ${\mathbf{\Psi}}_{\mathit{Y}}$). If the sample size is sufficiently large and the noise size is sufficiently small, then the residual terms are much smaller in size compared to ${\mathit{Z}}_{\mathit{X}}$ and ${\mathit{Z}}_{\mathit{Y}}$. To see this, it is helpful to leverage the Gaussianity (Eq.**17a**) to compute that for each $1\le j\le n$, the $j$th row of ${\mathit{Z}}_{\mathit{X}}$ obeys$$\mathbb{E}\left[{\u2225{\mathit{Z}}_{\mathit{X}}^{\top}{\mathit{e}}_{j}\u2225}_{2}^{2}\right]=\mathrm{Tr}\left(\frac{{\sigma}^{2}}{p}{\left({\mathbf{\Sigma}}^{\star}\right)}^{-1}\right)\ge \frac{{\sigma}^{2}r}{p{\sigma}_{max}};$$

**18**) is much smaller than $\sigma \sqrt{r/\left(p{\sigma}_{max}\right)}$ (and hence smaller than the size of the corresponding row of ${\mathit{Z}}_{\mathit{X}}$) with high probability.### Remark 1.

*Another interesting feature—which we make precise in the proof of Theorem 1—is that for any given*$1\le i,j\le n$

*, the two random vectors*${\mathit{Z}}_{\mathit{X}}^{\top}{\mathit{e}}_{i}$

*and*${\mathit{Z}}_{\mathit{Y}}^{\top}{\mathit{e}}_{j}$

*are nearly statistically independent. This is crucial for deriving inferential guarantees for the entries of the matrix.*

#### Distributional guarantees for matrix entries.

Equipped with the above theory for low-rank factors and

*Remark 1*, we are ready to characterize the distribution of ${M}_{ij}^{\hspace{0.17em}\mathrm{d}}-{M}_{ij}^{\star}$.### Theorem 2.

*For each*$1\le i,j\le n$,

*define the variance*${v}_{ij}^{\star}$

*as*

$${v}_{ij}^{\star}\triangleq \frac{{\sigma}^{2}}{p}\left({\u2225{\mathit{U}}_{i,\cdot}^{\star}\u2225}_{2}^{2}+{\u2225{\mathit{V}}_{j,\cdot}^{\star}\u2225}_{2}^{2}\right),$$

[19]

*where*${\mathit{U}}_{i,\cdot}^{\star}$ (

*resp.*${\mathit{V}}_{j,\cdot}^{\star}$)

*denotes the*$i$

*th*(

*resp.*$j$

*th*)

*row of*${\mathit{U}}^{\star}$ (

*resp.*${\mathit{V}}^{\star}$)

*. Suppose that*

$$np\gtrsim {\kappa}^{8}{\mu}^{3}{r}^{3}{\mathrm{log}}^{3}n,\hspace{1em}\sigma \sqrt{\left({\kappa}^{8}\mu rn{\mathrm{log}}^{2}n\right)/p}\lesssim {\sigma}_{min},$$

[20a]

$$\mathit{and}\hspace{0.17em}{\u2225{\mathit{U}}_{i,\cdot}^{\star}\u2225}_{2}+{\u2225{\mathit{V}}_{j,\cdot}^{\star}\u2225}_{2}\gtrsim \sqrt{\frac{r}{n}}\frac{\sigma}{{\sigma}_{min}}\sqrt{\frac{{\kappa}^{6}{\mu}^{2}rn{\mathrm{log}}^{3}n}{p}}.$$

[20b]

*Then the matrix*${\mathit{M}}^{\hspace{0.17em}\mathrm{d}}$

*defined in*

*Table 1*

*satisfies*

$${M}_{ij}^{\hspace{0.17em}\mathrm{d}}-{M}_{ij}^{\star}={g}_{ij}+{\mathrm{\Delta}}_{ij},$$

[21]

*where*${g}_{ij}\sim \mathcal{N}\left(0,{v}_{ij}^{\star}\right)$

*and the residual obeys*$\left|{\mathrm{\Delta}}_{ij}\right|=o\left(\sqrt{{v}_{ij}^{\star}}\right)$

*with probability exceeding*$1-O\left({n}^{-3}\right)$.

Several remarks are in order. First, we develop some intuition regarding where the formula ${v}_{ij}^{\star}$ comes from. By virtue of Assuming that the first-order expansion is tight, one hasAccording to Here, (i) relies on Eq.

*Theorem 1*, one has the following Gaussian approximation$${\mathit{X}}^{\hspace{0.17em}\mathrm{d}}{\mathit{H}}^{\hspace{0.17em}\mathrm{d}}-{\mathit{X}}^{\star}\approx {\mathit{Z}}_{\mathit{X}}\hspace{1em}\text{and}\hspace{1em}{\mathit{Y}}^{\hspace{0.17em}\mathrm{d}}{\mathit{H}}^{\hspace{0.17em}\mathrm{d}}-{\mathit{Y}}^{\star}\approx {\mathit{Z}}_{\mathit{Y}}.$$

$$\begin{array}{ll}\hfill & {M}_{ij}^{\hspace{0.17em}\mathrm{d}}-{M}_{ij}^{\star}={\left[{\mathit{X}}^{\hspace{0.17em}\mathrm{d}}{\mathit{H}}^{\hspace{0.17em}\mathrm{d}}{\left({\mathit{Y}}^{\hspace{0.17em}\mathrm{d}}{\mathit{H}}^{\hspace{0.17em}\mathrm{d}}\right)}^{\top}-{\mathit{X}}^{\star}{\mathit{Y}}^{\star \top}\right]}_{ij}\hfill \\ \hfill & \approx {\mathit{e}}_{i}^{\top}\left({\mathit{X}}^{\hspace{0.17em}\mathrm{d}}{\mathit{H}}^{\hspace{0.17em}\mathrm{d}}-{\mathit{X}}^{\star}\right){\mathit{Y}}^{\star \top}{\mathit{e}}_{j}+{\mathit{e}}_{i}^{\top}{\mathit{X}}^{\star}{\left({\mathit{Y}}^{\hspace{0.17em}\mathrm{d}}{\mathit{H}}^{\hspace{0.17em}\mathrm{d}}-{\mathit{Y}}^{\star}\right)}^{\top}{\mathit{e}}_{j}\hfill \\ \hfill & \approx {\mathit{e}}_{i}^{\top}{\mathit{Z}}_{\mathit{X}}{\mathit{Y}}^{\star \top}{\mathit{e}}_{j}+{\mathit{e}}_{i}^{\top}{\mathit{X}}^{\star}{\mathit{Z}}_{\mathit{Y}}^{\top}{\mathit{e}}_{j}.\hfill \end{array}$$

[22]

*Remark 1*, ${\mathit{Z}}_{\mathit{X}}^{\top}{\mathit{e}}_{i}$ and ${\mathit{Z}}_{\mathit{Y}}^{\top}{\mathit{e}}_{j}$ are nearly independent. One can thus compute the variance of Eq.**22**as$$\begin{array}{ccc}\hfill & \hfill & \mathrm{Var}\left({M}_{ij}^{\hspace{0.17em}\mathrm{d}}-{M}_{ij}^{\star}\right)\stackrel{\left(\text{i}\right)}{\approx}\mathrm{Var}\left({\mathit{e}}_{i}^{\top}{\mathit{Z}}_{\mathit{X}}{\mathit{Y}}^{\star \top}{\mathit{e}}_{j}\right)+\mathrm{Var}\left({\mathit{e}}_{i}^{\top}{\mathit{X}}^{\star}{\mathit{Z}}_{\mathit{Y}}^{\top}{\mathit{e}}_{j}\right)\hfill \\ \hfill & \hfill & \stackrel{\left(\text{ii}\right)}{=}{p}^{-1}{\sigma}^{2}\left\{{\mathit{e}}_{j}^{\top}{\mathit{Y}}^{\star}{\left({\mathbf{\Sigma}}^{\star}\right)}^{-1}{\mathit{Y}}^{\star \top}{\mathit{e}}_{j}+{\mathit{e}}_{i}^{\top}{\mathit{X}}^{\star}{\left({\mathbf{\Sigma}}^{\star}\right)}^{-1}{\mathit{X}}^{\star \top}{\mathit{e}}_{i}\right\}\hfill \\ \hfill & \hfill & \stackrel{\left(\text{iii}\right)}{=}{p}^{-1}{\sigma}^{2}\left({\u2225{\mathit{U}}_{i,\cdot}^{\star}\u2225}_{2}^{2}+{\u2225{\mathit{V}}_{j,\cdot}^{\star}\u2225}_{2}^{2}\right)={v}_{ij}^{\star}.\hfill \end{array}$$

**22**and the near independence between ${\mathit{Z}}_{\mathit{X}}^{\top}{\mathit{e}}_{i}$ and ${\mathit{Z}}_{\mathit{Y}}^{\top}{\mathit{e}}_{j}$, (ii) uses the variance formula in*Theorem 1*, and (iii) arises from the definitions of ${\mathit{X}}^{\star}$ and ${\mathit{Y}}^{\star}$ (cf. Eq.**1**). This explains (heuristically) the variance formula ${v}_{ij}^{\star}$.Given that

*Theorem 2*reveals the tightness of Gaussian approximation under conditions in Eq.**20**, it in turn allows us to construct nearly accurate confidence intervals for each matrix entry ${M}_{ij}^{\star}$. This is formally summarized in the following corollary. Here, $\left[a\pm b\right]$ denotes the interval $\left[a-b,a+b\right]$.**Corollary 1** (Confidence Intervals for the Entries $\left\{{M}_{ij}^{\star}\right\}$).

*Let*${\mathit{X}}^{\hspace{0.17em}\mathrm{d}}$, ${\mathit{Y}}^{\hspace{0.17em}\mathrm{d}}$,

*and*${\mathit{M}}^{\hspace{0.17em}\mathrm{d}}$

*be as defined in*

*Table 1*

*. For any given*$1\le i,j\le n$,

*suppose that*

*Eq*.

**20a**

*holds and that*

$${\u2225{\mathit{U}}_{i,\cdot}^{\star}\u2225}_{2}+{\u2225{\mathit{V}}_{j,\cdot}^{\star}\u2225}_{2}\gtrsim \sqrt{\frac{r}{n}}\frac{\sigma}{{\sigma}_{min}}\sqrt{\frac{{\kappa}^{10}{\mu}^{2}rn{\mathrm{log}}^{3}n}{p}}.$$

[23]

*Denote by*$\mathrm{\Phi}\left(t\right)$

*the CDF of a standard Gaussian random variable and by*${\mathrm{\Phi}}^{-1}\left(\cdot \right)$

*its inverse function. Let*

$${v}_{ij}\triangleq \frac{{\sigma}^{2}}{p}\left({\mathit{X}}_{i,\cdot}^{\hspace{0.17em}\mathrm{d}}{\left({\mathit{X}}^{\mathrm{d}\top}{\mathit{X}}^{\hspace{0.17em}\mathrm{d}}\right)}^{-1}{\left({\mathit{X}}_{i,\cdot}^{\hspace{0.17em}\mathrm{d}}\right)}^{\top}+{\mathit{Y}}_{j,\cdot}^{\hspace{0.17em}\mathrm{d}}{\left({\mathit{Y}}^{\mathrm{d}\top}{\mathit{Y}}^{\hspace{0.17em}\mathrm{d}}\right)}^{-1}{\left({\mathit{Y}}_{j,\cdot}^{\hspace{0.17em}\mathrm{d}}\right)}^{\top}\right)$$

[24]

*be the empirical estimate of*${v}_{ij}^{\star}$

*. Then one has*

$$\underset{0<\alpha <1}{sup}\left|\mathbb{P}\left\{{M}_{ij}^{\star}\in \left[{M}_{ij}^{\hspace{0.17em}\mathrm{d}}\pm {\mathrm{\Phi}}^{-1}\left(1-\alpha /2\right)\sqrt{{v}_{ij}}\right]\right\}-\left(1-\alpha \right)\right|=o\left(1\right).$$

In words, is a nearly accurate $\left(1-\alpha \right)$ confidence interval of ${M}_{ij}^{\star}$.

*Corollary 1*tells us that for any fixed significance level $0<\alpha <1$, the interval$$\left[{M}_{ij}^{\hspace{0.17em}\mathrm{d}}\pm {\mathrm{\Phi}}^{-1}\left(1-\alpha /2\right)\sqrt{{v}_{ij}}\right]$$

[25]

In addition, we remark that when ${\Vert {\mathit{U}}_{i,\cdot}^{\star}\Vert}_{2}={\Vert {\mathit{V}}_{j,\cdot}^{\star}\Vert}_{2}=0$ (and hence ${V}_{ij}^{\star}=0$), the above Gaussian approximation is completely off. In this case, one can still leverage where $\mathit{u},\mathit{v}\in {\mathbb{R}}^{r}$ are independent and identically distributed according to $\mathcal{N}\left(\mathbf{0},{\sigma}^{2}{\left({\mathbf{\Sigma}}^{\star}\right)}^{-1}/p\right)$. However, it is nontrivial to determine whether ${\Vert {\mathit{U}}_{i,\cdot}^{\star}\Vert}_{2}+{\Vert {\mathit{V}}_{j,\cdot}^{\star}\Vert}_{2}$ is vanishingly small or not based on the observed data, which makes it challenging to conduct efficient inference for entries with small (but a priori unknown) ${\Vert {\mathit{U}}_{i,\cdot}^{\star}\Vert}_{2}+{\Vert {\mathit{V}}_{j,\cdot}^{\star}\Vert}_{2}$.

*Theorem 1*to show that$${M}_{ij}^{\hspace{0.17em}\mathrm{d}}-{M}_{ij}^{\star}={M}_{ij}^{\hspace{0.17em}\mathrm{d}}\approx {\mathit{u}}^{\top}\mathit{v},$$

[26]

Last but not least, the careful reader might wonder how to interpret our conditions on the sample complexity and the signal-to-noise ratio. Take the case with $r,\mu ,\kappa =O\left(1\right)$ for example: Our conditions readThe first condition matches the sample complexity limit (up to some log factor), while the second one coincides with the regime (up to log factor) in which popular algorithms (like spectral methods or nonconvex algorithms) work better than a random guess (7, 10, 11). The take-away message is this: Once we are able to compute a reasonable estimate in an overall ${\ell}_{2}$ sense, then we can reinforce it to conduct entrywise inference in a statistically efficient fashion.

$${n}^{2}p\gtrsim n{\mathrm{log}}^{3}n;\hspace{1em}\sigma /{\sigma}_{min}\lesssim \sqrt{p/\left(n{\mathrm{log}}^{2}n\right)}.$$

[27]

### Lower Bounds and Optimality for Inference.

It is natural to ask how well our inferential procedures perform compared to other algorithms. Encouragingly, the debiased estimator is optimal in some sense; for instance, it attains the minimum covariance among all unbiased estimators. To formalize this claim, we 1) quantify the performance of 2 ideal estimators with the assistance of an oracle and 2) demonstrate that the performance of our debiased estimators is arbitrarily close to that of the ideal estimators. We remark in passing such results here; see

*SI Appendix*for precise statements. Below, we denote by ${\mathit{X}}_{i,\cdot}^{\star}$ (resp. ${\mathit{Y}}_{i,\cdot}^{\star}$) the $i$th row of ${\mathit{X}}^{\star}$ (resp. ${\mathit{Y}}^{\star}$).#### Lower bound for estimating ${\mathit{X}}_{i,\cdot}^{\star}\hspace{0.17em}\left(1\le i\le n\right)$.

Suppose there is an oracle informing us of ${\mathit{Y}}^{\star}$ and we observe the same set of data as in Eq. This reveals that the covariance of the estimator ${\mathit{X}}_{i,\cdot}^{\hspace{0.17em}\mathrm{d}}$ (cf.

**3**. Under such an idealistic setting and under our sample complexity condition, one has, with high probability, that any unbiased estimator ${\hat{\mathit{X}}}_{i,\cdot}$ of ${\mathit{X}}_{i,\cdot}^{\star}$ satisfies$$\mathrm{Cov}\left({\hat{\mathit{X}}}_{i,\cdot}-{\mathit{X}}_{i,\cdot}^{\star}\mid \mathrm{\Omega}\right)\u2ab0\left(1-o\left(1\right)\right)\hspace{0.17em}{p}^{-1}{\sigma}^{2}{\left({\mathbf{\Sigma}}^{\star}\right)}^{-1}.$$

*Theorem 1*) attains the Cramér–Rao lower bound with high probability. The same conclusion applies to ${\mathit{Y}}_{j,\cdot}^{\hspace{0.17em}\mathrm{d}}$ too.#### Lower bound for estimating ${M}_{ij}^{\star}$ ($1\le i,j\le n$).

Suppose there is another oracle informing us of ${\left\{{\mathit{X}}_{k,\cdot}^{\star}\right\}}_{k:k\ne i}$ and ${\left\{{\mathit{Y}}_{k,\cdot}^{\star}\right\}}_{k:k\ne j}$, that is, everything about ${\mathit{X}}^{\star}$ except ${\mathit{X}}_{i,\cdot}^{\star}$ and everything about ${\mathit{Y}}^{\star}$ except ${\mathit{Y}}_{j,\cdot}^{\star}$. In addition, we observe the same set of data as in Eq.

**3**, except that we do not get to see ${M}_{ij}$.^{¶}Under this idealistic model, one can show that with high probability, any unbiased estimator of ${M}_{ij}^{\star}$ must have variance no smaller than $\left(1-o\left(1\right)\right){v}_{ij}^{\star}$, where ${v}_{ij}^{\star}$ is defined in*Theorem 2*. This indicates that the variance of our debiased estimator ${M}_{ij}^{\hspace{0.17em}\mathrm{d}}$ (cf.*Theorem 2*)—which certainly does not have access to the side information provided by the oracle—is arbitrarily close to the Cramér–Rao lower bound aided by an oracle.### Back to Estimation: The Debiased Estimator Is Optimal.

While the emphasis herein is on inference, we nevertheless single out an important consequence that informs the estimation step. To be specific, the distributional guarantees derived in

*Theorems 1*and*2*allow us to track the estimation accuracy of ${\mathit{M}}^{\hspace{0.17em}\mathrm{d}}$, as stated below.### Theorem 3 (Estimation Accuracy of ${\mathit{M}}^{\hspace{0.17em}\mathrm{d}}$).

*Let*${\mathit{M}}^{\hspace{0.17em}\mathrm{d}}$

*be the debiased estimator as defined in*

*Table 1*

*. Instate the conditions in*

*Eq.*

**20a**

*. Then with probability at least*$1-O\left({n}^{-3}\right)$,

*one has*

$${\u2225{\mathit{M}}^{\hspace{0.17em}\mathrm{d}}-{\mathit{M}}^{\star}\u2225}_{\mathrm{F}}^{2}=\left(2+o\left(1\right)\right)\hspace{0.17em}nr{\sigma}^{2}/p.$$

[28]

In stark contrast to prior statistical estimation guarantees (e.g., refs. 4–6 and 13),

*Theorem 3*pins down the estimation error of the proposed debiased estimator in a sharp manner (namely, even the preconstant is fully determined). Encouragingly, there is a sense in which the proposed debiased estimator achieves the best possible statistical estimation accuracy. In fact, a lower bound has already been derived in ref. 4, section III.B, asserting that one cannot beat the mean-square estimation error $\left(2-o\left(1\right)\right)\hspace{0.17em}nr{\sigma}^{2}/p$ even with the help of an oracle. See*SI Appendix*for a precise statement.The implication of

*Theorems 1*to*3*is remarkable: The debiasing step not merely facilitates uncertainty assessment, but also proves crucial in minimizing estimation errors. It achieves optimal statistical efficiency in terms of both the rate and the preconstant. This theory about a polynomial time algorithm matches the statistical limit in terms of the preconstant. This intriguing finding is further corroborated by numerical experiments (see Fig. 2).### Numerical Experiments.

We conduct numerical experiments on synthetic data to verify the distributional characterizations provided in

*Theorem 2*. The verification of*Theorem 1*is left to*SI Appendix*. Note that our main results hold for the debiased estimators built upon ${\mathit{Z}}^{\mathrm{cvx}}$ and ${\mathit{X}}^{\hspace{0.17em}\mathrm{ncvx}}{\mathit{Y}}^{\mathrm{ncvx}\top}$. As we formalize in*SI Appendix*, these 2 debiased estimators are extremely close to each other. Therefore, to save space, we use the debiased estimator built upon the convex estimate ${\mathit{Z}}^{\mathrm{cvx}}$ throughout the experiments.Fix the dimension $n=\mathrm{1,000}$ and the regularization parameter $\lambda =2.5\sigma \sqrt{np}$ throughout the experiments. We generate a rank-$r$ matrix ${\mathit{M}}^{\star}={\mathit{X}}^{\star}{\mathit{Y}}^{\star \top}$, where ${\mathit{X}}^{\star},{\mathit{Y}}^{\star}\in {\mathbb{R}}^{n\times r}$ are random orthonormal matrices, and apply the proximal gradient method to solve the convex program Eq.

**5**.Denote ${S}_{ij}\triangleq {{v}_{ij}}^{-1/2}\left({M}_{ij}^{\hspace{0.17em}\mathrm{d}}-{M}_{ij}^{\star}\right)$, where ${v}_{ij}$ is the empirical variance defined in Eq.

**24**. In view of the $95\%$ confidence interval predicted by*Corollary 1*, for each $\left(i,j\right)$, we define ${\hat{\mathrm{Cov}}}_{\mathrm{E},\left(i,j\right)}$ to be the empirical coverage rate of ${M}_{ij}^{\star}$ over 200 Monte Carlo simulations. Correspondingly, denote by $\mathrm{Mean}\left({\hat{\mathrm{Cov}}}_{\mathrm{E}}\right)$ (resp. $\mathrm{Std}\left({\hat{\mathrm{Cov}}}_{\mathrm{E}}\right)$) the average (resp. the SD) of ${\hat{\mathrm{Cov}}}_{\mathrm{E},\left(i,j\right)}$ over indexes $1\le i,j\le n$. As before, Table 2 gathers the empirical coverage rates for ${M}_{ij}^{\star}$ and Fig. 1 displays the quantile–quantile (Q-Q) plots of ${S}_{11}$ and ${S}_{12}$ vs. the standard Gaussian random variable over 200 Monte Carlo trials for $r=5$, $p=0.4$, and $\sigma =1{0}^{-3}$. It is evident that the distribution of ${S}_{ij}$ matches that of $\mathcal{N}\left(\mathrm{0,1}\right)$ reasonably well.Table 2.

$\left(r,p,\sigma \right)$ | $\mathrm{Mean}\left({\hat{\mathrm{Cov}}}_{\mathrm{E}}\right)$ | $\mathrm{Std}\left({\hat{\mathrm{Cov}}}_{\mathrm{E}}\right)$ |
---|---|---|

$\left(\mathrm{2,0.2},1{0}^{-6}\right)$ | 0.9380 | 0.0200 |

$\left(\mathrm{2,0.2},1{0}^{-3}\right)$ | 0.9392 | 0.0196 |

$\left(\mathrm{2,0.4},1{0}^{-6}\right)$ | 0.9455 | 0.0164 |

$\left(\mathrm{2,0.4},1{0}^{-3}\right)$ | 0.9456 | 0.0164 |

$\left(\mathrm{5,0.2},1{0}^{-6}\right)$ | 0.9226 | 0.0247 |

$\left(\mathrm{5,0.2},1{0}^{-3}\right)$ | 0.9271 | 0.0228 |

$\left(\mathrm{5,0.4},1{0}^{-6}\right)$ | 0.9410 | 0.0173 |

$\left(\mathrm{5,0.4},1{0}^{-3}\right)$ | 0.9417 | 0.0172 |

Fig. 1.

In addition to the tractable distributional guarantees, the debiased estimator ${\mathit{M}}^{\hspace{0.17em}\mathrm{d}}$ also exhibits superior estimation accuracy compared to the original estimator ${\mathit{Z}}^{\mathrm{cvx}}$ (cf.

*Theorem 3*). Fig. 2 reports the estimation error of ${\mathit{M}}^{\hspace{0.17em}\mathrm{d}}$ vs. ${\mathit{Z}}^{\mathrm{cvx}}$ measured in both the Frobenius norm and the ${\ell}_{\infty}$ norm across different noise levels. The results are averaged over 20 Monte Carlo simulations for $r=5$, $p=0.2$. It can be seen that the errors of the debiased estimator are uniformly smaller than that of the original estimator and are much closer to the oracle lower bound. As a result, we recommend using ${\mathit{M}}^{\hspace{0.17em}\mathrm{d}}$ even for the purpose of estimation.Fig. 2.

We conclude this section with experiments on real data. Similar to ref. 4, we use the daily temperature data (31) for 1,400 stations across the world in 2018, which results in a $\mathrm{1,400}\times 365$ data matrix. Inspection of the singular values reveals that the data matrix is nearly low rank. We vary the observation probability $p$ from 0.5 to 0.9 and randomly subsample the data accordingly. Based on the observed temperatures, we then apply the proposed methodology to obtain $95\%$ confidence intervals for all of the entries. Table 3 reports the empirical coverage probabilities and the average length of the confidence intervals as well as the estimation error of both ${\mathit{Z}}^{\mathrm{cvx}}$ and ${\mathit{M}}^{\hspace{0.17em}\mathrm{d}}$ over 20 independent experiments. It can be seen that the average coverage probabilities are reasonably close to $95\%$ and the confidence intervals are also quite short. In addition, the estimation error of ${\mathit{M}}^{\hspace{0.17em}\mathrm{d}}$ is smaller than that of ${\mathit{Z}}^{\mathrm{cvx}}$, which corroborates our theoretical prediction. The discrepancy between the nominal coverage probability and the actual one might arise from the facts that 1) the underlying true temperature matrix is only approximately low rank and 2) the noise in the temperature might not be independent.

Table 3.

**Empirical coverage rates and average lengths of the confidence intervals of the entries as well as the estimation error vs. observation probability**$p$

$\text{Coverage}$ | CI length | ${\Vert \hat{\mathbf{Z}}-{\mathbf{M}}^{\star}\Vert}_{\mathrm{F}}/{\Vert {\mathbf{M}}^{\star}\Vert}_{\mathrm{F}}$ | ||||
---|---|---|---|---|---|---|

$p$ | Mean | SD | Mean | SD | Convex ${\mathbf{Z}}^{\mathrm{cvx}}$ | Debiased ${\mathbf{M}}^{\hspace{0.17em}\mathrm{d}}$ |

0.5 | 0.8265 | 0.0016 | 3.6698 | 0.0209 | 0.029 | 0.028 |

0.6 | 0.8268 | 0.0011 | 2.8774 | 0.0098 | 0.025 | 0.023 |

0.7 | 0.8431 | 0.0006 | 2.3426 | 0.0054 | 0.022 | 0.019 |

0.8 | 0.8725 | 0.0003 | 2.0234 | 0.0052 | 0.020 | 0.015 |

0.9 | 0.9093 | 0.0003 | 1.8296 | 0.0072 | 0.018 | 0.011 |

The results are averaged over 20 Monte Carlo trials.

## Discussion

The present paper makes progress toward inference and uncertainty quantification for noisy matrix completion, by developing simple debiased estimators that admit tractable and accurate distributional characterizations. While we have achieved some early success in accomplishing this, our results are likely suboptimal in the dependency on the rank $r$ and the condition number $\kappa $. Also, our theory operates under the moderate-to-high signal-to-noise ratio (SNR) regime, where ${\sigma}_{min}^{2}/{\sigma}^{2}$ (which is proportional to the SNR) is required to exceed the order of $n/p$; see the conditions in

*Theorem 1*. How to conduct inference in the low SNR regime is an important future direction.More broadly, this paper uncovers that computational feasibility and full statistical efficiency can sometimes be simultaneously achieved despite a high degree of nonconvexity. The analysis and insights herein might shed light on inference for a broader class of nonconvex statistical problems.

## Acknowledgments

Y.C. is supported in part by the Air Force Office of Scientific Research Young Investigator Program award FA9550-19-1-0030, by the Office of Naval Research (ONR) grant N00014-19-1-2120, by the Army Research Office grant W911NF-18-1-0303, by the NSF grants CCF-1907661 and IIS-1900140, and by the Princeton School of Engineering and Applied Science innovation award. J.F. is supported in part by NSF grants DMS-1662139 and DMS-1712591, ONR grant N00014-19-1-2120, and NIH grant 2R01-GM072611-13. C.M. is supported in part by Hudson River Trading AI Labs (HAIL) Fellowship. This work was done in part while Y.C. was visiting the Kavli Institute for Theoretical Physics (supported in part by the NSF grant PHY-1748958). We thank Weijie Su for helpful discussions.

## Supporting Information

Appendix (PDF)

- Download
- 1.04 MB

## References

1

N. Srebro, “Learning with matrix factorizations,” PhD thesis, Massachusetts Institute of Technology, Cambridge, MA (2004).

2

E. Candès, B. Recht, Exact matrix completion via convex optimization.

*Found. Comput. Math.***9**, 717–772 (2009).3

R. H. Keshavan, A. Montanari, S. Oh, Matrix completion from a few entries.

*IEEE Trans. Inf. Theory***56**, 2980–2998 (2010).4

E. Candès, Y. Plan, Matrix completion with noise.

*Proc. IEEE***98**, 925–936 (2010).5

S. Negahban, M. Wainwright, Restricted strong convexity and weighted matrix completion: Optimal bounds with noise.

*J. Mach. Learn. Res.***13**, 1665–1697 (2012).6

V. Koltchinskii, K. Lounici, A. B. Tsybakov, Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion.

*Ann. Stat.***39**, 2302–2329 (2011).7

R. H. Keshavan, A. Montanari, S. Oh, Matrix completion from noisy entries.

*J. Mach. Learn. Res.***11**, 2057–2078 (2010).8

D. Gross, Recovering low-rank matrices from few coefficients in any basis.

*IEEE Trans. Inf. Theory***57**, 1548–1566 (2011).9

R. Foygel, N. Srebro, “Concentration-based guarantees for low-rank matrix reconstruction” in

*Conference on Learning Theory*, S. M. Kakade, U. von Luxburg, Eds. (Proceedings of Machine Learning Research, Budapest, Hungary, 2011), pp. 315–340.10

Y. Chen, M. J. Wainwright, Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv:1509.03025 (10 September 2015).

11

C. Ma, K. Wang, Y. Chi, Y. Chen, Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution.

*Found. Comput. Math.,*(2019).12

A. Carpentier, O. Klopp, M. Löffler, R. Nickl, Adaptive confidence sets for matrix completion.

*Bernoulli***24**, 2429–2460 (2018).13

Y. Chen, Y. Chi, J. Fan, C. Ma, Y. Yan, Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. arXiv:1902.07698 (20 February 2019).

14

C. H. Zhang, S. S. Zhang, Confidence intervals for low dimensional parameters in high dimensional linear models.

*J. R. Stat. Soc. Ser. B***76**, 217–242 (2014).15

A. Javanmard, A. Montanari, Confidence intervals and hypothesis testing for high-dimensional regression.

*J. Mach. Learn. Res.***15**, 2869–2909 (2014).16

R. Lockhart, J. Taylor, R. J. Tibshirani, R. Tibshirani, A significance test for the lasso.

*Ann. Stat.***42**, 413–468 (2014).17

S. van de Geer, P. Bühlmann, Y. Ritov, R. Dezeure, On asymptotically optimal confidence regions and tests for high-dimensional models.

*Ann. Stat.***42**, 1166–1202 (2014).18

T. T. Cai, Z. Guo, Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity.

*Ann. Stat.***45**, 615–646 (2017).19

Y. Ning, H. Liu, A general theory of hypothesis tests and confidence regions for sparse high dimensional models.

*Ann. Stat.***45**, 158–195 (2017).20

J. Jankova, S. Van De Geer, Confidence intervals for high-dimensional inverse covariance estimation.

*Electron. J. Stat.***9**, 1205–1229 (2015).21

Z. Ren, T. Sun, C. Zhang, H. Zhou, Asymptotic normality and optimalities in estimation of large Gaussian graphical models.

*Ann. Stat.***43**, 991–1026 (2015).22

B. Recht, M. Fazel, P. A. Parrilo, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization.

*SIAM Rev.***52**, 471–501 (2010).23

N. Srebro, A. Shraibman, “Rank, trace-norm and max-norm” in

*International Conference on Computational Learning Theory*, P. Auer, R. Meir, Eds. (Springer, Berlin, Germany, 2005), pp. 545–560.24

R. Mazumder, T. Hastie, R. Tibshirani, Spectral regularization algorithms for learning large incomplete matrices.

*J. Mach. Learn. Res.***11**, 2287–2322 (2010).25

R. Sun, Z. Q. Luo, Guaranteed matrix completion via non-convex factorization.

*IEEE Trans. Inf. Theory***62**, 6535–6579 (2016).26

Y. Chi, Y. M. Lu, Y. Chen, Nonconvex optimization meets low-rank matrix factorization: An overview.

*IEEE Trans Signal Process***67**, 5239–5269 (2019).27

A. M. C. So, Y. Ye, Theory of semidefinite programming for sensor network localization.

*Math. Program.***109**, 367–384 (2007).28

E. Abbe, J. Fan, K. Wang, Y. Zhong, Entrywise eigenvector analysis of random matrices with low expected rank. arXiv:1709.09565 (2 May 2017).

29

A. Singer, Angular synchronization by eigenvectors and semidefinite programming.

*Appl. Comput. Harmon. Anal.***30**, 20–36 (2011).30

J. Fan, Y. Liao, M. Mincheva, Large covariance estimation by thresholding principal orthogonal complements.

*J. R. Stat. Soc. Ser. B***75**, 603–680 (2013).31

National Climatic Data Center. https://www.ncdc.noaa.gov/ (31 August 2019).

## Information & Authors

### Information

#### Published in

#### Classifications

#### Copyright

Copyright © 2019 the Author(s). Published by PNAS. This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).

#### Submission history

**Published online**: October 30, 2019

**Published in issue**: November 12, 2019

#### Keywords

#### Acknowledgments

Y.C. is supported in part by the Air Force Office of Scientific Research Young Investigator Program award FA9550-19-1-0030, by the Office of Naval Research (ONR) grant N00014-19-1-2120, by the Army Research Office grant W911NF-18-1-0303, by the NSF grants CCF-1907661 and IIS-1900140, and by the Princeton School of Engineering and Applied Science innovation award. J.F. is supported in part by NSF grants DMS-1662139 and DMS-1712591, ONR grant N00014-19-1-2120, and NIH grant 2R01-GM072611-13. C.M. is supported in part by Hudson River Trading AI Labs (HAIL) Fellowship. This work was done in part while Y.C. was visiting the Kavli Institute for Theoretical Physics (supported in part by the NSF grant PHY-1748958). We thank Weijie Su for helpful discussions.

#### Notes

This article is a PNAS Direct Submission.

†

We restrict attention to squared matrices for simplicity of presentation. Most findings extend immediately to the more general rectangular case ${\mathit{M}}^{\star}\in {\mathbb{R}}^{{n}_{1}\times {n}_{2}}$ with different ${n}_{1}$ and ${n}_{2}$.

‡

The true rank $r$ can often be reliably estimated in a data-dependent manner. For instance, according to ref. 13, theorem 1, one can employ a rank estimator $\widehat{r}\triangleq \underset{i}{min}\left\{{\sigma}_{i+1}\left(\mathit{Z}\right)/{\sigma}_{i}\left(\mathit{Z}\right)\le {n}^{-1/2}\right\}$, which recovers the true rank with high probability under our assumptions.

§

For any $r\times r$ rotation matrix $\mathit{H}$, we cannot possibly distinguish $\left({\mathit{X}}^{\star},{\mathit{Y}}^{\star}\right)$ from $\left({\mathit{X}}^{\star}\mathit{H},{\mathit{Y}}^{\star}\mathit{H}\right)$, if only pairwise measurements are available.

¶

The exclusion of ${M}_{ij}$ is merely for ease of presentation. One can consider the model where all ${M}_{ij}$ with $\left(i,j\right)\in \mathrm{\Omega}$ are observed with a slightly more complicated argument.

### Authors

#### Competing Interests

The authors declare no competing interest.

## Metrics & Citations

### Metrics

#### Citation statements

#### Altmetrics

### Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

#### Cited by

Loading...

## View Options

### View options

#### PDF format

Download this article as a PDF file

DOWNLOAD PDF### Get Access

#### Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Personal login Institutional Login#### Recommend to a librarian

Recommend PNAS to a Librarian#### Purchase options

Purchase this article to get full access to it.