Inference and uncertainty quantification for noisy matrix completion

Edited by Peter J. Bickel, University of California, Berkeley, CA, and approved October 8, 2019 (received for review June 11, 2019)
October 30, 2019
116 (46) 22931-22937

Significance

Matrix completion finds numerous applications in data science, ranging from information retrieval to medical imaging. While substantial progress has been made in designing estimation algorithms, it remains unknown how to perform optimal statistical inference on the unknown matrix given the obtained estimates—a task at the core of modern decision making. We propose procedures to debias the popular convex and nonconvex estimators and derive distributional characterizations for the resulting debiased estimators. This distributional theory enables valid inference on the unknown matrix. Our procedures 1) yield optimal construction of confidence intervals for missing entries and 2) achieve optimal estimation accuracy in a sharp manner.

Abstract

Noisy matrix completion aims at estimating a low-rank matrix given only partial and corrupted entries. Despite remarkable progress in designing efficient estimation algorithms, it remains largely unclear how to assess the uncertainty of the obtained estimates and how to perform efficient statistical inference on the unknown matrix (e.g., constructing a valid and short confidence interval for an unseen entry). This paper takes a substantial step toward addressing such tasks. We develop a simple procedure to compensate for the bias of the widely used convex and nonconvex estimators. The resulting debiased estimators admit nearly precise nonasymptotic distributional characterizations, which in turn enable optimal construction of confidence intervals/regions for, say, the missing entries and the low-rank factors. Our inferential procedures do not require sample splitting, thus avoiding unnecessary loss of data efficiency. As a byproduct, we obtain a sharp characterization of the estimation accuracy of our debiased estimators in both rate and constant. Our debiased estimators are tractable algorithms that provably achieve full statistical efficiency.
Low-rank matrix completion is concerned with recovering a low-rank matrix, when only a small fraction of its entries are revealed (13). The importance of this problem cannot be overstated, due to its broad applications in, e.g., recommendation systems, sensor network localization, magnetic resonance imaging, computer vision, large covariance estimation, and latent factor learning to name just a few. Tackling this problem in large-scale applications is computationally challenging, resulting from the intrinsic nonconvexity incurred by the low-rank structure. To further complicate matters, another inevitable challenge stems from the imperfectness of data acquisition mechanisms, wherein the acquired samples are usually contaminated by a certain amount of noise.
Fortunately, if the entries of the unknown matrix are sufficiently delocalized and randomly revealed, this problem may not be as hard as it seems. Substantial progress has been made over the past several years in designing computationally tractable algorithms—including both convex and nonconvex approaches— that allow one to fill in unseen entries faithfully from partial noisy samples (413). Nevertheless, modern decision making would often require one step further. It not merely anticipates a faithful estimate, but also seeks to quantify the uncertainty or “confidence” of the provided estimate, ideally in a reasonably accurate fashion. For instance, given an estimate returned by the convex approach, how does one use it to identify a short interval that is likely to contain a missing entry?
Conducting effective uncertainty quantification for noisy matrix completion is, however, far from straightforward. For the most part, the state-of-the-art matrix completion algorithms require solving highly complex optimization problems, which often do not admit closed-form solutions. Of necessity, it is extremely challenging to pin down the distributions of the estimates returned by these algorithms. The lack of distributional characterizations presents a major roadblock to performing valid, yet efficient, statistical inference on the unknown matrix of interest.
It is worth noting that a number of recent papers have been dedicated to inference and uncertainty quantification for various high-dimensional problems, including Lasso (1418), generalized linear models (17, 19), and graphical models (20, 21), among others. Very little work, however, has looked into noisy matrix completion along this direction. While nonasymptotic statistical guarantees for noisy matrix completion have been derived in prior theory, the existing estimation error bounds are supplied only at an order-wise level. Such order-wise error bounds either lose a significant factor relative to the optimal guarantees or come with an unspecified (but often enormous) preconstant. Viewed in this light, a confidence region constructed directly based on such results is bound to be overly conservative, resulting in substantial overcoverage.

A Glimpse of Our Main Contributions

This paper takes a substantial step toward statistically optimal inference and uncertainty quantification for noisy matrix completion. Specifically, we develop a simple procedure to compensate for the bias of the widely used convex and nonconvex estimators. The resulting debiased estimators admit nearly accurate nonasymptotic distributional guarantees. Such distributional characterizations in turn allow us to reason about the uncertainty of the obtained estimates vis-à-vis the unknown matrix. For instance, we can construct 1) confidence intervals for each entry—either observed or missing—of the unknown matrix and 2) confidence regions for the low-rank factors of interest (modulo some global ambiguity), both of which are provably optimal. As a byproduct, we characterize the Euclidean estimation errors of the proposed debiased estimators, which match statistical efficiency precisely (including the preconstant). This theory demonstrates that a computationally feasible algorithm can achieve the best possible statistical efficiency (including the preconstant) for noisy matrix completion.

Models and Notation

To cast noisy matrix completion in concrete statistical settings, we adopt a model commonly studied in the literature.

Ground Truth.

We are interested in estimating an unknown rank-r matrix MRn×n, whose rank-r singular-value decomposition (SVD) is given by M=UΣV. For convenience, let XUΣ1/2Rn×r and YVΣ1/2Rn×r be the balanced low-rank factors of M, which obey
XX=YY=ΣandM=XY.
[1]
Denote by σi(M) the ith largest singular value of M. Set
σmaxσ1(M),σminσr(M),andκσmax/σmin.
[2]

Observation Models.

What we observe is a randomly subsampled and corrupted subset of the entries of M; namely,
Mij=Mij+Eij,Eiji.i.d.N(0,σ2),for all(i,j)Ω,
[3]
where Ω{1,,n}×{1,,n} is a small set of indexes, and Eij denotes independently generated noise at the location (i,j). From now on, we assume the random sampling model where each index (i,j) is included in Ω independently with probability p (i.e., data are missing uniformly at random). We use PΩ() to represent the orthogonal projection onto the subspace of matrices that vanish outside the index set Ω.

Incoherence Conditions.

Clearly, not all matrices can be reliably estimated from a highly incomplete set of measurements. To address this issue, we impose a standard incoherence condition (2) on the singular subspaces of M (i.e., U and V),
max{U2,,V2,}μr/n,
[4]
where μ is termed the incoherence parameter and A2, denotes the largest 2 norm of all rows in A. A small μ implies that the energies of U and V are reasonably spread out across all of their rows.

Asymptotic Notation.

f(n)h(n) (or f(n)=O(h(n))) means |f(n)|c1|h(n)| for some constant c1>0, f(n)h(n) means |f(n)|c2|h(n)| for some constant c2>0, f(n)h(n) means c2|h(n)||f(n)|c1|h(n)| for some constants c1,c2>0, and f(n)=o(h(n)) means limnf(n)/h(n)=0.

Inferential Procedures and Main Results

The proposed inferential procedure has its basis on 2 of the most popular matrix completion paradigms: convex relaxation and nonconvex optimization. Recognizing the complicated bias of these 2 highly nonlinear estimators and motivated by refs. 14, 15, and 17, we first illustrate how to perform bias correction, followed by a theory that establishes the near-Gaussianity and optimality of the proposed debiased estimators.
Algorithm 1.
Gradient descent for solving Eq. 7
Suitable initialization: X0, Y0 (SI Appendix)
Gradient updates: for t=0,1,,t01 do
 
Xt+1=Xtηp[PΩ(XtYtM)Yt+λXt],
[6a]
Yt+1=Ytηp[[PΩ(XtYtM)]Xt+λYt],
[6b]
 
where η>0 determines the step size or the learning rate.

Background: Convex and Nonconvex Estimation Algorithms.

We first review in passing 2 computationally feasible estimation algorithms that are arguably the most widely used in practice. They serve as the starting point for us to design inferential procedures for noisy low-rank matrix completion.

Convex Relaxation.

Recall that the rank function is highly nonconvex, which often prevents us from computing a rank-constrained estimator in polynomial time. For the sake of computational feasibility, prior works suggest relaxing the rank function into its convex surrogate (22); for example, one can consider a penalized least-squares convex program
minimizeZRn×n12PΩZMF2+λZ*.
[5]
Here, * is the nuclear norm (the sum of singular values, which is a convex surrogate of the rank function), and λ>0 is some regularization parameter. Under mild conditions, the solution to the convex program Eq. 5 attains appealing estimation accuracy (in an order-wise sense), provided that a proper regularization parameter λ is adopted (4, 13).

Nonconvex Optimization.

It is recognized that the convex approach, which typically relies on solving a semidefinite program, is still expensive and not scalable to large dimensions. This motivates an alternative route, which represents the matrix variable via 2 low-rank factors X,YRn×r and attempts solving the following nonconvex program directly:
minimizeX,YRn×r12PΩXYMF2+λ2XF2+λ2YF2.
[7]
Here, we choose a regularizer of the form 0.5λ(XF2+YF2) primarily to mimic the nuclear norm λZ* (23, 24). A variety of optimization algorithms have been proposed to tackle the nonconvex program Eq. 7 or its variants (7, 10, 11, 25); readers are referred to ref. 26 for a recent overview. As a prominent example, a 2-stage algorithm—gradient descent following suitable initialization—provably enjoys fast convergence for a wide range of scenarios (11, 13). The present paper focuses on this simple yet powerful algorithm, as documented in Algorithm 1 and detailed in SI Appendix.

Intimate Connections between Convex and Nonconvex Estimates.

Denote by Zcvx any minimizer of the convex program Eq. 5 and (Xncvx,Yncvx) the estimate returned by Algorithm 1 aimed at solving Eq. 7. As was recently shown in ref. 13, when the regularization parameter λ is properly chosen, these 2 estimates provably obey (see SI Appendix for a precise statement)
XncvxYncvxZcvx.
[8]
In truth, the 2 matrices in Eq. 8 are exceedingly close to, if not identical with, each other. This salient feature paves the way for a unified treatment of convex and nonconvex approaches: Most inferential procedures and guarantees developed for the nonconvex estimate can be readily transferred to perform inference for the convex one, and vice versa.

Constructing Debiased Estimators.

We are now well equipped to describe how to construct estimators based on the convex estimate Zcvx and the nonconvex estimate (Xncvx,Yncvx), to enable efficient inference. Motivated by the proximity of the convex and nonconvex estimates and for the sake of conciseness, we abuse notation by using Z,X,Y for both convex and nonconvex estimates; see Table 1 and SI Appendix for precise definitions. This allows us to unify the presentation for both convex and nonconvex estimators.
Table 1.
Notation used to unify the convex estimate Zcvx and the nonconvex estimate (Xncvx,Yncvx)
ZRn×nEither Zcvx or XncvxYncvx
X,YRn×rFor the nonconvex case, we take X=Xncvx and Y=Yncvx; for the convex case, let X=Xcvx and Y=Ycvx, which are the balanced low-rank factors of Zcvx,r obeying Zcvx,r=XcvxYcvx and XcvxXcvx=YcvxYcvx.
MdRn×nThe proposed debiased estimator as in Eq. 10.
Xd,YdRn×rThe proposed estimator as in Eq. 11.
Here, Zcvx,r=Prank-r(Zcvx) is the best rank-r approximation of Zcvx. See SI Appendix for a complete summary.
Given that Eqs. 5 and 7 are both regularized least-squares problems, they behave effectively like shrinkage estimators, indicating that the provided estimates necessarily suffer from nonnegligible bias. To enable the desired statistical inference, it is natural to first correct the estimation bias.

A debiased estimator for the matrix.

A natural debiasing strategy that immediately comes to mind is the simple linear transformation (recall the notation in Table 1)
Z0Zp1PΩZM=p1PΩMmean:M+p1PΩEmean:0+Zp1PΩZmean:0(heuristically),
[9]
where we identify PΩ(M) with PΩ(M)+PΩE. Heuristically, if Ω and Z are statistically independent, then Z0 serves as an unbiased estimator of M, i.e., E[Z0]=M; this arises since the noise E has zero mean and E[PΩ]=pI under the uniform random sampling model, with I the identity operator. Despite its (near) unbiasedness nature at a heuristic level, however, the matrix Z0 is typically full rank, with nonnegligible energy spread across its entire spectrum. This results in dramatically increased variability in the estimate, which is undesirable for inferential purposes.
To remedy this issue, we propose to further project Z0 onto the set of rank-r matrices, leading to the estimator
MdPrank-rZ1pPΩZM,
[10]
where Prank-r(B)argminA:rank(A)rABF, and Z can again be found in Table 1. This projection step effectively suppresses the variability outside the r-dimensional principal subspace. As we shall demonstrate, the proposed estimator Eq. 10 provably debiases the provided estimate Z, while optimally controlling the extent of variability.

An equivalent perspective on the low-rank factors.

As it turns out, the debiased estimator Eq. 10 admits another almost equivalent representation that offers further insights. Specifically, we consider the following estimator for the low-rank factors,
XdXIr+p1λ(XX)11/2,
[11a]
YdYIr+p1λ(YY)11/2,
[11b]
where we recall the definition of X and Y in Table 1. To develop some intuition, let us look at a simple scenario where UΣV is the rank-r SVD of XY and X=UΣ1/2, Y=VΣ1/2. It is then self-evident that Xd=U(Σ+(λ/p)Ir)1/2 and Yd=V(Σ+(λ/p)Ir)1/2. In words, Xd and Yd are obtained by deshrinking the spectrum of X and Y properly.
As we formally establish in SI Appendix, the estimator Eq. 11 for the low-rank factors is extremely close to the debiased estimator Eq. 10 for the whole matrix, in the sense that
MdXdYd.
[12]

Main Results: Distributional Guarantees.

The proposed estimators admit tractable distributional characterizations in the large-n regime, which facilitates the construction of confidence regions for many quantities of interest. In particular, this paper centers around 2 types of inferential problems:
1)
Each entry of the matrix M: The entry can be either missing (i.e., predicting an unseen entry) or observed (i.e., denoising an observed entry). For example, in the problem of sensor localization (27), one wants to infer the distance between any 2 sensors, given partially revealed distances. Mathematically, this seeks to determine the distribution of
MijdMij,for all1i,jn.
[13]
2)
The low-rank factors X,YRn×r: The low-rank factors often reveal critical information about the applications of interest [e.g., community memberships of each individual in the community detection problem (28), angles between each object and a global reference point in the angular synchronization problem (29), or factor loadings and latent factors in factor analysis (30)]. Recognizing the global rotational ambiguity issue,§ we aim to pin down the distributions of Xd and Yd up to global rotational ambiguity. More precisely, we intend to characterize the distributions of
XdHdXandYdHdY
[14]
for the global rotation matrix HdRr×r that best “aligns” (Xd,Yd) and (X,Y), i.e.,
HdargminROr×rXdRXF2+YdRYF2.
[15]
Here, Or×r is the set of orthonormal matrices in Rr×r.
Clearly, the above 2 inferential problems are tightly related: An accurate distributional characterization for the low-rank factors (Eq. 14) often results in a distributional guarantee for the entries (Eq. 13).

Distributional guarantees for low-rank factors.

We begin with our distributional characterizations of the low-rank factors. Here, ei denotes the ith standard basis vector in Rn.

Theorem 1.

Suppose that the sample complexity meets n2pCκ8μ3r2nlog3n for some sufficiently large constant C>0 and the noise obeys σ/σmincp/(κ8μrnlog2n) for some sufficiently small constant c>0. Then one can write
XdHdX=ZX+ΨX,
[16a]
YdHdY=ZY+ΨY,
[16b]
with (X,Y) defined in Eq. 1, (Xd,Yd) defined in Table 1, and Hd defined in Eq. 15. Here, the rows of ZXRn×r (resp. ZYRn×r) are independent and obey
ZXeji.i.d.N0,σ2pΣ1,for1jn;
[17a]
ZYeji.i.d.N0,σ2pΣ1,for1jn.
[17b]
In addition, the residual matrices ΨX,ΨYRn×r satisfy, with probability at least 1O(n3), that
maxΨX2,,ΨY2,=oσrpσmax.
[18]
In words, Theorem 1 decomposes the estimation error XdHdX (resp. YdHdY) into a Gaussian component ZX (resp. ZY) and a residual term ΨX (resp. ΨY). If the sample size is sufficiently large and the noise size is sufficiently small, then the residual terms are much smaller in size compared to ZX and ZY. To see this, it is helpful to leverage the Gaussianity (Eq. 17a) to compute that for each 1jn, the jth row of ZX obeys
EZXej22=Trσ2pΣ1σ2rpσmax;
in other words, the typical size of the jth row of ZX is no smaller than the order of σr/(pσmax). In comparison, the size of each row of ΨX (Eq. 18) is much smaller than σr/(pσmax) (and hence smaller than the size of the corresponding row of ZX) with high probability.

Remark 1.

Another interesting feature—which we make precise in the proof of Theorem 1—is that for any given 1i,jn, the two random vectors ZXei and ZYej are nearly statistically independent. This is crucial for deriving inferential guarantees for the entries of the matrix.

Distributional guarantees for matrix entries.

Equipped with the above theory for low-rank factors and Remark 1, we are ready to characterize the distribution of MijdMij.

Theorem 2.

For each 1i,jn, define the variance vij as
vijσ2pUi,22+Vj,22,
[19]
where Ui, (resp. Vj,) denotes the ith (resp. jth) row of U (resp. V). Suppose that
npκ8μ3r3log3n,σ(κ8μrnlog2n)/pσmin,
[20a]
andUi,2+Vj,2rnσσminκ6μ2rnlog3np.
[20b]
Then the matrix Md defined in Table 1 satisfies
MijdMij=gij+Δij,
[21]
where gijN(0,vij) and the residual obeys |Δij|=o(vij) with probability exceeding 1O(n3).
Several remarks are in order. First, we develop some intuition regarding where the formula vij comes from. By virtue of Theorem 1, one has the following Gaussian approximation
XdHdXZXandYdHdYZY.
Assuming that the first-order expansion is tight, one has
MijdMij=XdHdYdHdXYijeiXdHdXYej+eiXYdHdYejeiZXYej+eiXZYej.
[22]
According to Remark 1, ZXei and ZYej are nearly independent. One can thus compute the variance of Eq. 22 as
VarMijdMij(i)VareiZXYej+VareiXZYej=(ii)p1σ2ejYΣ1Yej+eiXΣ1Xei=(iii)p1σ2Ui,22+Vj,22=vij.
Here, (i) relies on Eq. 22 and the near independence between ZXei and ZYej, (ii) uses the variance formula in Theorem 1, and (iii) arises from the definitions of X and Y (cf. Eq. 1). This explains (heuristically) the variance formula vij.
Given that Theorem 2 reveals the tightness of Gaussian approximation under conditions in Eq. 20, it in turn allows us to construct nearly accurate confidence intervals for each matrix entry Mij. This is formally summarized in the following corollary. Here, [a±b] denotes the interval [ab,a+b].

Corollary 1 (Confidence Intervals for the Entries {Mij}).

Let Xd, Yd, and Md be as defined in Table 1. For any given 1i,jn, suppose that Eq. 20a holds and that
Ui,2+Vj,2rnσσminκ10μ2rnlog3np.
[23]
Denote by Φ(t) the CDF of a standard Gaussian random variable and by Φ1() its inverse function. Let
vijσ2pXi,dXdXd1(Xi,d)+Yj,dYdYd1(Yj,d)
[24]
be the empirical estimate of vij. Then one has
sup0<α<1PMijMijd±Φ11α/2vij(1α)=o(1).
In words, Corollary 1 tells us that for any fixed significance level 0<α<1, the interval
Mijd±Φ1(1α/2)vij
[25]
is a nearly accurate (1α) confidence interval of Mij.
In addition, we remark that when Ui,2=Vj,2=0 (and hence Vij=0), the above Gaussian approximation is completely off. In this case, one can still leverage Theorem 1 to show that
MijdMij=Mijduv,
[26]
where u,vRr are independent and identically distributed according to N(0,σ2(Σ)1/p). However, it is nontrivial to determine whether Ui,2+Vj,2 is vanishingly small or not based on the observed data, which makes it challenging to conduct efficient inference for entries with small (but a priori unknown) Ui,2+Vj,2.
Last but not least, the careful reader might wonder how to interpret our conditions on the sample complexity and the signal-to-noise ratio. Take the case with r,μ,κ=O(1) for example: Our conditions read
n2pnlog3n;σ/σminp/(nlog2n).
[27]
The first condition matches the sample complexity limit (up to some log factor), while the second one coincides with the regime (up to log factor) in which popular algorithms (like spectral methods or nonconvex algorithms) work better than a random guess (7, 10, 11). The take-away message is this: Once we are able to compute a reasonable estimate in an overall 2 sense, then we can reinforce it to conduct entrywise inference in a statistically efficient fashion.

Lower Bounds and Optimality for Inference.

It is natural to ask how well our inferential procedures perform compared to other algorithms. Encouragingly, the debiased estimator is optimal in some sense; for instance, it attains the minimum covariance among all unbiased estimators. To formalize this claim, we 1) quantify the performance of 2 ideal estimators with the assistance of an oracle and 2) demonstrate that the performance of our debiased estimators is arbitrarily close to that of the ideal estimators. We remark in passing such results here; see SI Appendix for precise statements. Below, we denote by Xi, (resp. Yi,) the ith row of X (resp. Y).

Lower bound for estimating Xi,(1in).

Suppose there is an oracle informing us of Y and we observe the same set of data as in Eq. 3. Under such an idealistic setting and under our sample complexity condition, one has, with high probability, that any unbiased estimator X^i, of Xi, satisfies
CovX^i,Xi,Ω1o(1)p1σ2Σ1.
This reveals that the covariance of the estimator Xi,d (cf. Theorem 1) attains the Cramér–Rao lower bound with high probability. The same conclusion applies to Yj,d too.

Lower bound for estimating Mij (1i,jn).

Suppose there is another oracle informing us of {Xk,}k:ki and {Yk,}k:kj, that is, everything about X except Xi, and everything about Y except Yj,. In addition, we observe the same set of data as in Eq. 3, except that we do not get to see Mij. Under this idealistic model, one can show that with high probability, any unbiased estimator of Mij must have variance no smaller than (1o(1))vij, where vij is defined in Theorem 2. This indicates that the variance of our debiased estimator Mijd (cf. Theorem 2)—which certainly does not have access to the side information provided by the oracle—is arbitrarily close to the Cramér–Rao lower bound aided by an oracle.

Back to Estimation: The Debiased Estimator Is Optimal.

While the emphasis herein is on inference, we nevertheless single out an important consequence that informs the estimation step. To be specific, the distributional guarantees derived in Theorems 1 and 2 allow us to track the estimation accuracy of Md, as stated below.

Theorem 3 (Estimation Accuracy of Md).

Let Md be the debiased estimator as defined in Table 1. Instate the conditions in Eq.20a. Then with probability at least 1O(n3), one has
MdMF2=(2+o(1))nrσ2/p.
[28]
In stark contrast to prior statistical estimation guarantees (e.g., refs. 46 and 13), Theorem 3 pins down the estimation error of the proposed debiased estimator in a sharp manner (namely, even the preconstant is fully determined). Encouragingly, there is a sense in which the proposed debiased estimator achieves the best possible statistical estimation accuracy. In fact, a lower bound has already been derived in ref. 4, section III.B, asserting that one cannot beat the mean-square estimation error (2o(1))nrσ2/p even with the help of an oracle. See SI Appendix for a precise statement.
The implication of Theorems 1 to 3 is remarkable: The debiasing step not merely facilitates uncertainty assessment, but also proves crucial in minimizing estimation errors. It achieves optimal statistical efficiency in terms of both the rate and the preconstant. This theory about a polynomial time algorithm matches the statistical limit in terms of the preconstant. This intriguing finding is further corroborated by numerical experiments (see Fig. 2).

Numerical Experiments.

We conduct numerical experiments on synthetic data to verify the distributional characterizations provided in Theorem 2. The verification of Theorem 1 is left to SI Appendix. Note that our main results hold for the debiased estimators built upon Zcvx and XncvxYncvx. As we formalize in SI Appendix, these 2 debiased estimators are extremely close to each other. Therefore, to save space, we use the debiased estimator built upon the convex estimate Zcvx throughout the experiments.
Fix the dimension n=1,000 and the regularization parameter λ=2.5σnp throughout the experiments. We generate a rank-r matrix M=XY, where X,YRn×r are random orthonormal matrices, and apply the proximal gradient method to solve the convex program Eq. 5.
Denote Sijvij1/2MijdMij, where vij is the empirical variance defined in Eq. 24. In view of the 95% confidence interval predicted by Corollary 1, for each (i,j), we define Cov^E,(i,j) to be the empirical coverage rate of Mij over 200 Monte Carlo simulations. Correspondingly, denote by Mean(Cov^E) (resp. Std(Cov^E)) the average (resp. the SD) of Cov^E,(i,j) over indexes 1i,jn. As before, Table 2 gathers the empirical coverage rates for Mij and Fig. 1 displays the quantile–quantile (Q-Q) plots of S11 and S12 vs. the standard Gaussian random variable over 200 Monte Carlo trials for r=5, p=0.4, and σ=103. It is evident that the distribution of Sij matches that of N(0,1) reasonably well.
Table 2.
Empirical coverage rates of Mij for different (r,p,σ) over 200 Monte Carlo trials
(r,p,σ)Mean(Cov^E)Std(Cov^E)
(2,0.2,106)0.93800.0200
(2,0.2,103)0.93920.0196
(2,0.4,106)0.94550.0164
(2,0.4,103)0.94560.0164
(5,0.2,106)0.92260.0247
(5,0.2,103)0.92710.0228
(5,0.4,106)0.94100.0173
(5,0.4,103)0.94170.0172
Fig. 1.
Q-Q plots of S11 (Left) and S12 (Right) vs. the standard normal distribution. The results are reported over 200 independent trials for r=5, p=0.4, and σ=103.
In addition to the tractable distributional guarantees, the debiased estimator Md also exhibits superior estimation accuracy compared to the original estimator Zcvx (cf. Theorem 3). Fig. 2 reports the estimation error of Md vs. Zcvx measured in both the Frobenius norm and the norm across different noise levels. The results are averaged over 20 Monte Carlo simulations for r=5, p=0.2. It can be seen that the errors of the debiased estimator are uniformly smaller than that of the original estimator and are much closer to the oracle lower bound. As a result, we recommend using Md even for the purpose of estimation.
Fig. 2.
(Left) Estimation error of Zcvx vs. Md measured in the Frobenius norm. (Right) Estimation error of Zcvx vs. Md measured in the norm. The results are averaged over 20 independent trials for r=5, p=0.2, and n=1,000.
We conclude this section with experiments on real data. Similar to ref. 4, we use the daily temperature data (31) for 1,400 stations across the world in 2018, which results in a 1,400×365 data matrix. Inspection of the singular values reveals that the data matrix is nearly low rank. We vary the observation probability p from 0.5 to 0.9 and randomly subsample the data accordingly. Based on the observed temperatures, we then apply the proposed methodology to obtain 95% confidence intervals for all of the entries. Table 3 reports the empirical coverage probabilities and the average length of the confidence intervals as well as the estimation error of both Zcvx and Md over 20 independent experiments. It can be seen that the average coverage probabilities are reasonably close to 95% and the confidence intervals are also quite short. In addition, the estimation error of Md is smaller than that of Zcvx, which corroborates our theoretical prediction. The discrepancy between the nominal coverage probability and the actual one might arise from the facts that 1) the underlying true temperature matrix is only approximately low rank and 2) the noise in the temperature might not be independent.
Table 3.
Empirical coverage rates and average lengths of the confidence intervals of the entries as well as the estimation error vs. observation probability p
 CoverageCI lengthZ^MF/MF
pMeanSDMeanSDConvex ZcvxDebiased Md
0.50.82650.00163.66980.02090.0290.028
0.60.82680.00112.87740.00980.0250.023
0.70.84310.00062.34260.00540.0220.019
0.80.87250.00032.02340.00520.0200.015
0.90.90930.00031.82960.00720.0180.011
The results are averaged over 20 Monte Carlo trials.

Discussion

The present paper makes progress toward inference and uncertainty quantification for noisy matrix completion, by developing simple debiased estimators that admit tractable and accurate distributional characterizations. While we have achieved some early success in accomplishing this, our results are likely suboptimal in the dependency on the rank r and the condition number κ. Also, our theory operates under the moderate-to-high signal-to-noise ratio (SNR) regime, where σmin2/σ2 (which is proportional to the SNR) is required to exceed the order of n/p; see the conditions in Theorem 1. How to conduct inference in the low SNR regime is an important future direction.
More broadly, this paper uncovers that computational feasibility and full statistical efficiency can sometimes be simultaneously achieved despite a high degree of nonconvexity. The analysis and insights herein might shed light on inference for a broader class of nonconvex statistical problems.

Acknowledgments

Y.C. is supported in part by the Air Force Office of Scientific Research Young Investigator Program award FA9550-19-1-0030, by the Office of Naval Research (ONR) grant N00014-19-1-2120, by the Army Research Office grant W911NF-18-1-0303, by the NSF grants CCF-1907661 and IIS-1900140, and by the Princeton School of Engineering and Applied Science innovation award. J.F. is supported in part by NSF grants DMS-1662139 and DMS-1712591, ONR grant N00014-19-1-2120, and NIH grant 2R01-GM072611-13. C.M. is supported in part by Hudson River Trading AI Labs (HAIL) Fellowship. This work was done in part while Y.C. was visiting the Kavli Institute for Theoretical Physics (supported in part by the NSF grant PHY-1748958). We thank Weijie Su for helpful discussions.

Supporting Information

Appendix (PDF)

References

1
N. Srebro, “Learning with matrix factorizations,” PhD thesis, Massachusetts Institute of Technology, Cambridge, MA (2004).
2
E. Candès, B. Recht, Exact matrix completion via convex optimization. Found. Comput. Math. 9, 717–772 (2009).
3
R. H. Keshavan, A. Montanari, S. Oh, Matrix completion from a few entries. IEEE Trans. Inf. Theory 56, 2980–2998 (2010).
4
E. Candès, Y. Plan, Matrix completion with noise. Proc. IEEE 98, 925–936 (2010).
5
S. Negahban, M. Wainwright, Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. J. Mach. Learn. Res. 13, 1665–1697 (2012).
6
V. Koltchinskii, K. Lounici, A. B. Tsybakov, Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat. 39, 2302–2329 (2011).
7
R. H. Keshavan, A. Montanari, S. Oh, Matrix completion from noisy entries. J. Mach. Learn. Res. 11, 2057–2078 (2010).
8
D. Gross, Recovering low-rank matrices from few coefficients in any basis. IEEE Trans. Inf. Theory 57, 1548–1566 (2011).
9
R. Foygel, N. Srebro, “Concentration-based guarantees for low-rank matrix reconstruction” in Conference on Learning Theory, S. M. Kakade, U. von Luxburg, Eds. (Proceedings of Machine Learning Research, Budapest, Hungary, 2011), pp. 315–340.
10
Y. Chen, M. J. Wainwright, Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv:1509.03025 (10 September 2015).
11
C. Ma, K. Wang, Y. Chi, Y. Chen, Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution. Found. Comput. Math., (2019).
12
A. Carpentier, O. Klopp, M. Löffler, R. Nickl, Adaptive confidence sets for matrix completion. Bernoulli 24, 2429–2460 (2018).
13
Y. Chen, Y. Chi, J. Fan, C. Ma, Y. Yan, Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. arXiv:1902.07698 (20 February 2019).
14
C. H. Zhang, S. S. Zhang, Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B 76, 217–242 (2014).
15
A. Javanmard, A. Montanari, Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15, 2869–2909 (2014).
16
R. Lockhart, J. Taylor, R. J. Tibshirani, R. Tibshirani, A significance test for the lasso. Ann. Stat. 42, 413–468 (2014).
17
S. van de Geer, P. Bühlmann, Y. Ritov, R. Dezeure, On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Stat. 42, 1166–1202 (2014).
18
T. T. Cai, Z. Guo, Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. Ann. Stat. 45, 615–646 (2017).
19
Y. Ning, H. Liu, A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Stat. 45, 158–195 (2017).
20
J. Jankova, S. Van De Geer, Confidence intervals for high-dimensional inverse covariance estimation. Electron. J. Stat. 9, 1205–1229 (2015).
21
Z. Ren, T. Sun, C. Zhang, H. Zhou, Asymptotic normality and optimalities in estimation of large Gaussian graphical models. Ann. Stat. 43, 991–1026 (2015).
22
B. Recht, M. Fazel, P. A. Parrilo, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52, 471–501 (2010).
23
N. Srebro, A. Shraibman, “Rank, trace-norm and max-norm” in International Conference on Computational Learning Theory, P. Auer, R. Meir, Eds. (Springer, Berlin, Germany, 2005), pp. 545–560.
24
R. Mazumder, T. Hastie, R. Tibshirani, Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010).
25
R. Sun, Z. Q. Luo, Guaranteed matrix completion via non-convex factorization. IEEE Trans. Inf. Theory 62, 6535–6579 (2016).
26
Y. Chi, Y. M. Lu, Y. Chen, Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Trans Signal Process 67, 5239–5269 (2019).
27
A. M. C. So, Y. Ye, Theory of semidefinite programming for sensor network localization. Math. Program. 109, 367–384 (2007).
28
E. Abbe, J. Fan, K. Wang, Y. Zhong, Entrywise eigenvector analysis of random matrices with low expected rank. arXiv:1709.09565 (2 May 2017).
29
A. Singer, Angular synchronization by eigenvectors and semidefinite programming. Appl. Comput. Harmon. Anal. 30, 20–36 (2011).
30
J. Fan, Y. Liao, M. Mincheva, Large covariance estimation by thresholding principal orthogonal complements. J. R. Stat. Soc. Ser. B 75, 603–680 (2013).
31
National Climatic Data Center. https://www.ncdc.noaa.gov/ (31 August 2019).

Information & Authors

Information

Published in

Go to Proceedings of the National Academy of Sciences
Go to Proceedings of the National Academy of Sciences
Proceedings of the National Academy of Sciences
Vol. 116 | No. 46
November 12, 2019
PubMed: 31666329

Classifications

Submission history

Published online: October 30, 2019
Published in issue: November 12, 2019

Keywords

  1. confidence intervals
  2. convex relaxation
  3. nonconvex optimization

Acknowledgments

Y.C. is supported in part by the Air Force Office of Scientific Research Young Investigator Program award FA9550-19-1-0030, by the Office of Naval Research (ONR) grant N00014-19-1-2120, by the Army Research Office grant W911NF-18-1-0303, by the NSF grants CCF-1907661 and IIS-1900140, and by the Princeton School of Engineering and Applied Science innovation award. J.F. is supported in part by NSF grants DMS-1662139 and DMS-1712591, ONR grant N00014-19-1-2120, and NIH grant 2R01-GM072611-13. C.M. is supported in part by Hudson River Trading AI Labs (HAIL) Fellowship. This work was done in part while Y.C. was visiting the Kavli Institute for Theoretical Physics (supported in part by the NSF grant PHY-1748958). We thank Weijie Su for helpful discussions.

Notes

This article is a PNAS Direct Submission.
We restrict attention to squared matrices for simplicity of presentation. Most findings extend immediately to the more general rectangular case MRn1×n2 with different n1 and n2.
The true rank r can often be reliably estimated in a data-dependent manner. For instance, according to ref. 13, theorem 1, one can employ a rank estimator r^miniσi+1(Z)/σi(Z)n1/2, which recovers the true rank with high probability under our assumptions.
§
For any r×r rotation matrix H, we cannot possibly distinguish X,Y from XH,YH, if only pairwise measurements are available.
The exclusion of Mij is merely for ease of presentation. One can consider the model where all Mij with (i,j)Ω are observed with a slightly more complicated argument.

Authors

Affiliations

Department of Electrical Engineering, Princeton University, Princeton, NJ 08544;
Jianqing Fan
Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544
Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544
Yuling Yan
Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544

Notes

1
To whom correspondence may be addressed. Email: [email protected].
Author contributions: Y.C., J.F., C.M., and Y.Y. designed research; Y.C., J.F., C.M., and Y.Y. performed research; C.M. and Y.Y. analyzed data; and Y.C., J.F., C.M., and Y.Y. wrote the paper.

Competing Interests

The authors declare no competing interest.

Metrics & Citations

Metrics

Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.


Citation statements




Altmetrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by

    Loading...

    View Options

    View options

    PDF format

    Download this article as a PDF file

    DOWNLOAD PDF

    Get Access

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Personal login Institutional Login

    Recommend to a librarian

    Recommend PNAS to a Librarian

    Purchase options

    Purchase this article to get full access to it.

    Single Article Purchase

    Inference and uncertainty quantification for noisy matrix completion
    Proceedings of the National Academy of Sciences
    • Vol. 116
    • No. 46
    • pp. 22885-23363

    Media

    Figures

    Tables

    Other

    Share

    Share

    Share article link

    Share on social media