## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# On robust regression with high-dimensional predictors

Contributed by Peter J. Bickel, April 25, 2013 (sent for review March 1, 2012)

### See related content:

- Optimal M-estimation in high-dimensional regression- Aug 16, 2013

## Abstract

We study regression *M*-estimates in the setting where *p*, the number of covariates, and *n*, the number of observations, are both large, but . We find an exact stochastic representation for the distribution of at fixed *p* and *n* under various assumptions on the objective function *ρ* and our statistical model. A scalar random variable whose deterministic limit can be studied when plays a central role in this representation. We discover a nonlinear system of two deterministic equations that characterizes . Interestingly, the system shows that depends on *ρ* through proximal mappings of *ρ* as well as various aspects of the statistical model underlying our study. Several surprising results emerge. In particular, we show that, when is large enough, least squares becomes preferable to least absolute deviations for double-exponential errors.

In the “classical” period up to the 1980s, research on regression models focused on situations for which the number of covariates *p* was much smaller than *n*, the sample size. Least-squares regression (LSE) was the main fitting tool used, but its sensitivity to outliers came to the fore with the work of Tukey, Huber, Hampel, and others starting in the 1950s.

Given the model and M-estimation methods described in the Abstract, it follows from the discussion in ref. 1 (p. 170, for instance) that, if the design matrix *X* (an *n* × *p* matrix whose *i*th row is ) is nonsingular, under various regularity conditions on *X*, ρ, and the [independent identically distributed (i.i.d)] errors , is asymptotically normal with mean and covariance matrix . Here, and *ε* has the same distribution as ’s.

It follows that, for *p* fixed, the relative efficiency of M-estimates such as least absolute deviations (LAD), to LSE, does not depend on the design matrix. Thus, LAD has the same advantage over LSE for heavy-tailed distributions as the median has over the mean.

In recent years, there has been great focus on the case where *p* and *n* are commensurate and large. Greatest attention has been paid to the “sparse” case where the number of nonzero coefficients is much smaller than *n* or *p*. This has been achieved by adding an type of penalty to the quadratic objective function of LSE, in the case of the Least Absolute Shrinkage and Selection Operator (LASSO). Unfortunately, these types of methods result in biased estimates of the coefficients, and statistical inference, as opposed to prediction, becomes problematic.

Huber (2) was the first to investigate the regime of large *p* ( with *n*). His results were followed up by Portnoy (3) under weaker conditions [see also Bloomfield (4)]. Huber showed that the behavior found for fixed *p* persisted in regions such as and . That is, estimates of coefficients and contrasts were asymptotically Gaussian and relative efficiencies of methods did not depend on the design matrix. His arguments were, in part, heuristic but well confirmed by simulation. He also pointed out a surprising feature of the regime, for LSE; fitted values were not asymptotically Gaussian. He was unable to deal with this regime otherwise (see the discussion on p. 802 of ref. 2).

In this paper, we analyze what happens in robust regression when . We proceed in the manner of Huber (2), who developed heuristics that were highly plausible and buttressed by simulations. (While the paper was under review, we have managed to obtain rigorous proofs for many of our assertions. They will be presented elsewhere because they are very long and technical.) We give several results for covariates that are Gaussian or derived from Gaussian but present grounds that the behavior holds much more generally—the key being concentration of certain quadratic forms involving the vectors of covariates. We also investigate the sensitivity of our results to the geometry of the design matrix. [Further results with different designs can be found in our work (5).]

We find that (*i*) estimates of coordinates and contrasts that have coefficients independent of the observed covariates continue to be unbiased and asymptotically normal; and (*ii*) as in the fixed *p* case, this happens at scale , at least when the minimal and maximal eigenvalues of the covariance of the predictors stay bounded away from 0 and ∞, respectively.*

These findings are obtained by (*i*) using leave-one-out perturbation arguments both for the data units and predictors; (*ii*) exhibiting a pair of master equations from which the asymptotic mean square prediction error and the correct expressions for asymptotic variances can be recovered; and (*iii*) showing that these two quantities depend in a nonlinear way on *p*/*n*, the error distribution, the design matrix, and the form of the objective function, *ρ*.

It is worth noting that our findings go against the likelihood principle that the “ideal” objective function is the negative log-density of the error distribution. For example, we show that, when *p*/*n* is large enough, it becomes preferable to use least squares rather than LAD for double-exponential errors. We illustrate this point in Fig. 1.

*Main Results* contains a detailed presentation of our results. We give some examples and supporting simulations in *Examples*. We present our derivation in the last section.

## Main Results

We consider the following robust regression problem: let beHere, , , where , is a random (scalar) error independent of the vector . *ρ* is a convex function. We assume that the pairs and are independent. Furthermore, we assume that ’s are independent. Our aim is to characterize the distribution of . As we will discuss later, our approach is not limited to this “standard” robust regression setting: we can, for instance, shed light on similar questions in weighted regression.

The following lemma is easily shown by using the rotational invariance of the Gaussian distribution (*SI Text*).

### Lemma 1.

*Suppose that* , *where* ’*s are n i*.*i*.*d* , *with Σ of rank p*, *and* *are (nonzero) scalars*, *independent of* . *Let* *be the solution*^{†} *of* Eq. **1**. *When* , *we have the stochastic representation*:*where u is uniform on the sphere of radius* 1 *in* *and is independent of* . *Furthermore*, .

*In light of this result*, *it is clear that we just need to understand the distribution of* *to understand that of* .

### Result 1.

*Suppose that ρ is a nonlinear convex function*. *Let* . *Assume that* , *where* *are i*.*i*.*d* *and* *are nonzero scalars*, *independent of* . *Assume also that* *(i.e.,* *) and* *are independent of* .

*Then*, *under regularity conditions on* , *and ρ*, *has a deterministic limit in probability as p and n tend to infinity while* . *We call this limit* .

*Let us call* , *where* *are i*.*i*.*d and independent of* *and* . *We can determine* *through solving the following*:*where c is a positive deterministic constant to be determined from the above system*. (*The expectations above are taken with respect to the joint distribution of* , *and* .)

*The prox abbreviation refers to the proximal mapping*, *which is standard in convex optimization (see ref*. 6*). One of its definition is* .

### Corollary 1 (Important Special Case).

*When for all i*, , *and* ’*s are i*.*i*.*d*, *the same conclusions hold but the system characterizing* *becomes the following*: *if* , *where ε has the same distribution as* *and is independent of* ,

The asymptotic normality [in the finite dimensional (fidi) convergence sense] and unbiasedness of is a consequence of Result 1 and Lemma 1. Note the complicated interaction of *ρ*, the distribution of *ε*, the distribution of the ’s and *κ* in determining . In the *p* fixed case, , the contribution of the design is just , which determines the correlation matrix of . In general, the correlation structure is the same but the variances also depend on the design.

As our arguments will show, we expect that the results concerning detailed in Result 1 will hold when the assumptions of normality on are replaced by assumptions on concentration of quadratic forms in . Results on fidi convergence of also appear likely to hold under these weakened restrictions.

The difference between the systems of equations characterizing in Result 1 and Corollary 1 highlights the importance of the geometry of the predictors, , in the results. As a matter of fact, if we consider the case where and ’s are i.i.d with , in both situations the ’s have covariance and are nearly orthogonal to one another; however, in the setting of Result 1, is close to with high probability—hence variable with i—whereas in the setting of Corollary 1, the ’s all have almost the same norm and hence are near a sphere. The importance of the geometry of the vectors of predictors in this situation is hence a generalization of similar phenomena that were highlighted in ref. 7 for instance. Further examples of a different nature detailed in ref. 5 (p. 27) illustrate the fundamental importance of our implicit geometric assumptions on the design matrix.

Our analysis also extends to the case where ρ is replaced by , where for instance (the weighted regression case) as long as is independent of . (One simply needs to replace ρ by in the system **S1** above and take expectation with respect to these quantities, too.) We refer the interested reader to ref. 5 (p. 26).

## Examples

We illustrate the quality of our results on a few numerical examples, showing the importance of both the objective function and the distribution of the errors in the behavior of . For simplicity, we focus only on the case where for all *i*, i.e., the case of Gaussian predictors (an example with random is in *SI Text*). We also assume that ’s are i.i.d.

### Least Squares.

In this case, and . Hence, . Elementary computations then show that . We also find that , where is the variance of *ε*. Naturally, in the case of least squares, one can use results concerning Wishart distribution (8) as well as the explicit form of to verify mathematically that this expression is correct. We also note that, in this case, the distribution of *ε* does not matter, only its variance.

### Median Regression (LAD).

This case, where *ρ* takes values , is substantially more interesting and reveals the importance of the interaction between objective function and error distribution. Clearly, we first have to compute the prox of the function *ρ*. It is well known and not difficult to show that this prox is the soft-thresholding function. More formally, using the notation , we have, for any , . In this subsection, we use the notation instead of .

#### Case of Gaussian errors.

Let . When ’s are i.i.d , . The first equation of our system **S2** therefore becomes , where . Hence, , where is the quantile function for the standard normal distribution.

We now turn our attention to the second equation in the system **S2**. We have . Using the fact that , computations show that the second equation in the system **S2** becomes the following:where *h* is the function such that for ,Finally, calling ζ the function such that for , if *φ* denotes the standard normal density,further manipulations show that we can solve for *s* as a function of κ and therefore for . Our final expression is that, when the ’s are i.i.d ,Fig. 2 compares this expression for to obtained by simulations. The comparison is done by computing relative errors. (Fig. S1 compares the actual values, which are also of interest.)

#### Case of errors with symmetric distribution.

We call the density of and drop the dependence of on *ρ* and *κ* from our notations for simplicity. The first equation of system **S2** still reads . Let us call the cumulative distribution function (cdf) of and . Let us denote by the functional inverse of .

Integration by parts, symmetry of , as well as the above characterization of *c* finally lead to the implicit characterization of (denoted simply by *r* for short in the next equation):We note in passing that ; therefore, the previous equation can be rewritten , a convenient equation to work with numerically when *κ* is small.

#### Case of double-exponential errors.

We now present a comparison of simulation results to numerical solutions of system **S2** when the errors are double exponential. It should be noted that, in this case, the cdf takes valuesIt is also clear in this case that . We used all this information to solve Eq. **2** for *r*, by doing a dichotomous search. Fig. 3 illustrates our results by showing the relative errors between (computed from simulations) and numerical solutions of system **S2** with appropriate parameters. (Fig. S2 compares the actual values, which are also of interest.)

### Other Objective Functions.

We have carried out similar computations and validations of results for other objective functions, including the Huber objective functions, the objective functions appearing in quantile regression, as well as and objective functions—the latter two more for their analytical tractability than for their statistical interest. We refer the reader to ref. 5 for details.

### Further Remarks.

The characterizations of allows us to compare the performance of various regression methods for various error distributions. One mathematical and statistical consequence is that we can optimize over ρ to minimize when the distribution of the errors is given and log-concave and we are in the setup of Gaussian predictors. We have done this in the companion paper (9).

Quite independently, we can investigate the performance of say median regression vs. least squares for a range of values of *κ*. In the case of double-exponential errors, it is well known (see, e.g., ref. 1) that median regression is twice as efficient as least squares when *κ* is close to 0. As our simulations and computations illustrate, this is not the case when *κ* is not close to zero. Indeed, when or so, for double-exponential errors. This should serve as caution against using “natural” maximum-likelihood methods in high dimension since they turn out to be suboptimal even in apparently favorable situations.

## Derivation

We now turn our attention to the derivation of the system of equations **S1** presented in *Result 1*.

Our approach hinges on a “double leave-one-out approach,” the use of concentration properties of certain quadratic forms and the Sherman–Woodbury–Morrison formula of linear algebra.

We focus on the case and . *Lemma 1* guarantees that we can do so without loss of generality. Note that, in this case, . We call simply from now on. We also assume that *ρ* has two derivatives.

We call the residuals and use the notation . Recall that . We note that under our assumptions satisfies the gradient equation:In the derivations that follow, we will use repeatedly the fact that if are i.i.d and is a sequence of deterministic symmetric matrices, under mild conditions on the growth of with , we have as *n* and *p* growMany methods can be used to show this concentration result. A particularly simple one is to compute the second and fourth cumulants of . It shows that the result holds as soon as , a mild condition. This concentration result is easily extended to the case where is random but independent of . The previous result also extends easily to , under mild conditions on ’s, to yield the following:

### Leaving Out One Observation.

Let us call the usual leave-one-out estimator [i.e., the estimator we get by not using in our regression problem]. It solvesNote that, when are independent, is independent of . For all *j*, , we call When , these are the residuals from this leave-one-out situation. For , is the prediction error for observation *i*.

Intuitively, it is clear that under regularity conditions on ρ and ’s, when ’s are i.i.d, for , (this means statistically that leave-one-out makes sense). However, it is easy to convince oneself (by looking, e.g., at the least-squares situation) that is very different from in high dimension. The expansion we will get below will indeed confirm this fact in a more general setting than least squares.

Taking the difference between Eqs. **3** and **5**, we get, after using Taylor expansions for (and truncating the expansion at first order),We call . This suggests thatNote that is independent of . Hence, multiplying the previous expression by , we get, using the approximation given in Eq. **4** (which amounts to assuming that is “nice enough”),Experience in random matrix theory as well as the form of the matrix suggest that should have a deterministic limit (again under conditions^{‡} on ρ, ’s and ’s). Then, by symmetry between the observations, all are approximately the same, i.e., when *p* and *n* are large, . Hence,Note that because and are independent when ’s are independent and independent of and , much can be said about the distribution of . However, at this point in the derivation, it is not clear what the value of *c* should be.

### Leaving Out One Predictor.

Let us consider what happens when we leave the *p*th predictor out. Because we are assuming that is and , all of the predictors play a symmetric role, so we pick the *p*th to simplify notations. There is nothing particular about it and the same analysis can be done with any other predictors.

Let us call the corresponding optimal regression vector for the loss function *ρ*. We use the notations and partitionsWe have . Naturally, satisfiesWe calli.e., the residuals based on predictors. Note that is independent of under our assumptions [because is independent of and the ’s are i.i.d].

It is intuitively clear that^{§} , for all *i*, because adding a predictor will not help us much in estimating . Hence the residuals should not be much affected by the addition of one predictor. Taking the difference of the equations defining (Eq. **3**) and , we get the following:This *p*-dimensional equation separates into a scalar and a vector equation, namely,Using a first-order Taylor expansion of around and noting that , we can transform the first equation above intoThis gives the near identityWorking similarly on the equations involving , we getBecause , the previous equation readsCallingwe see that . Using this approximation in the previous equation for , we have finally

### Approximation of This Denominator.

Let us write in matrix form , where , and *D* is a diagonal matrix with . Note that is a projection matrix of rank *p* − 1 in general.

Let us call the denominator of divided by *n*. We have

Let us call . Using the Sherman–Morrison–Woodbury formula (see ref. 10, p. 19, and *SI Text*), we see that . We notice that can be approximated by a matrix , which is independent of (by using our leave-one-predictor-out observations) for which by Eq. **4** (these two approximations naturally require some regularity conditions on *ρ*, etc… so that is “nice enough”). Hence,Therefore, using the approximations and (because using Sherman–Morrison–Woodbury), we also haveBecause is a rank projection matrix, we have , and thereforeUsing concentration properties of conditional on , we haveReplacing by its approximate value, we getSo finally,Using again , we see thatassuming that we can take expectations in all these approximations.

### From Approximations to Functional System.

Our approximations concerning the residuals and shed considerable light on them. Our focus is now on .

From Eq. **8**, we got the approximationRecall that, for a convex, proper, and closed function *ρ*, whose subdifferential we call *ψ*, and , . It is an important fact that the prox is indeed a function and not a multivalued mapping, even when *ρ* is not differentiable everywhere. We therefore get the approximation

Recalling Eq. **6** and using the independence of and , we have , where is and independent of , and ( denotes “equal in law”).

We now argue that is asymptotically deterministic. Using the relationship between and in Eq. **7** and taking squared norms, we see thatAssuming that the smallest eigenvalue of remains bounded, which is automatically satisfied with high probability for strongly convex functions *ρ*, we see that is , provided remains bounded and *ψ* and do not grow too fast at infinity. Applying the Efron–Stein inequality, we see that if we take squared expectations in our approximations. It follows that is asymptotically deterministic.

These arguments suggest that, as *p* and *n* become large, , where , independent of and , and is deterministic. We also note that are i.i.d, because are.

Because , we see that Eq. **11** now becomes asymptoticallywhere the expectations are over the joint distribution of ’s, ’s, and ’s. (We note that our arguments do not depend on independence of ’s or ’s, although both families of random variables need to be independent of .) This is the second equation of system **S1**.

We now recall that using the fact that the matrix above was a projection matrix, we had argued that asymptoticallyWe observe that and therefore . This allows us to conclude that under regularity conditions,This is the first equation of our system **S1**.

### A Note on Nondifferentiable *ρ*’s.

One of the appeals of our systems **S1** and **S2** is that they yield expressions even in the case of nondifferentiable *ρ*, because the prox is well defined. However, we derived the systems assuming smoothness of *ρ*. To go around this hurdle, one can approximate *ρ* by a family of smooth convex functions such that as in an appropriate sense. Intuitively, it is quite clear that should tend to as *η* tends to 0 under appropriate regularity conditions on ’s and ’s. We then just need to take limits in our systems to justify them for nondifferentiable *ρ*’s.

## Acknowledgments

D.B. gratefully acknowledges support from National Science Foundation (NSF) Grant DMS-0636667 (Vertical Integration of Research and Education in the Mathematical Sciences); P.J.B. gratefully acknowledges support from NSF Grant DMS-0907362; N.E.K. gratefully acknowledges support from an Alfred P. Sloan Research Fellowship and NSF Grant DMS-0847647 (Faculty Early Career Development); and B.Y. gratefully acknowledges support from NSF Grants SES-0835531 (Cyber-Enabled Discovery and Innovation), DMS-1107000, and CCF-0939370.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. E-mail: nkaroui{at}stat.berkeley.edu or bickel{at}stat.berkeley.edu.

Author contributions: N.E.K., P.J.B., and B.Y. designed research; N.E.K., D.B., P.J.B., C.L., and B.Y. performed research; and N.E.K. and P.J.B. wrote the paper.

The authors declare no conflict of interest.

↵*And the vector defining the contrasts has norm bounded away from 0 and

_{.}↵

^{†}We write instead of for simplicity.↵

^{‡}To help with intuition, note that in the least-squares case, , a sample covariance matrix multiplied by_{.}↵

^{§}Under regularity conditions on*ρ*, ’s and ’s.This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1307842110/-/DCSupplemental.

## References

- ↵Huber PJ, Ronchetti EM (2009)
*Robust Statistics*. Wiley Series in Probability and Statistics (Wiley, Hoboken, NJ), 2nd Ed. - ↵Huber PJ (1973) Robust regression: Asymptotics, conjectures and Monte Carlo.
*Ann Stat*1:799–821. - ↵Portnoy S (1984) Asymptotic behavior of
*M*-estimators of*p*regression parameters when p^{2}/n is large. I. Consistency.*Ann Stat*12(4):1298–1309. - ↵Bloomfield P (1974) On the distribution of the residuals from a fitted linear model (Department of Statistics, Princeton Univ, Princeton, NJ), Technical Report 56, Series 2.
- ↵El Karoui N, Bean D, Bickel P, Lim C, Yu B (2012) On robust regression with high-dimensional predictors (Department of Statistics, Univ of California, Berkeley, CA), Technical Report 811.
- ↵Moreau J-J (1965) Proximité et dualité dans un espace hilbertien.
*Bull Soc Math France*93:273–299. French. - ↵
- ↵Eaton ML (1983)
*Multivariate Statistics: A Vector Space Approach*(Wiley, New York); reprinted (2007) Institute of Mathematical Statistics Lecture Notes—Monograph Series (Institute of Mathematical Statistics, Beachwood, OH), Vol 53. - ↵Bean D, Bickel PJ, El Karoui N, Yu B (2013) Optimal M-estimation in high-dimensional regression.
*Proc Natl Acad Sci USA*110:14563–14568. - ↵Horn RA, Johnson CR (1994)
*Topics in Matrix Analysis*(Cambridge Univ Press, Cambridge, UK), corrected reprint of the 1991 original.

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Physical Sciences
- Statistics