New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ^{1} minimization

Contributed by David L. Donoho
Abstract
Given a dictionary D = {d_{k}} of vectors d_{k}, we seek to represent a signal S as a linear combination S = ∑_{k} γ(k)d_{k}, with scalar coefficients γ(k). In particular, we aim for the sparsest representation possible. In general, this requires a combinatorial optimization process. Previous work considered the special case where D is an overcomplete system consisting of exactly two orthobases and has shown that, under a condition of mutual incoherence of the two bases, and assuming that S has a sufficiently sparse representation, this representation is unique and can be found by solving a convex optimization problem: specifically, minimizing the ℓ^{1} norm of the coefficients γ̱. In this article, we obtain parallel results in a more general setting, where the dictionary D can arise from two or several bases, frames, or even less structured systems. We sketch three applications: separating linear features from planar ones in 3D data, noncooperative multiuser encoding, and identification of overcomplete independent component models.
Workers throughout engineering and the applied sciences frequently want to represent data (signals, images) in the most parsimonious terms. In signal analysis specifically, they often consider models proposing that the signal of interest is sparse in some transform domain, such as the wavelet or Fourier domain (1). However, there is a growing realization that many signals are mixtures of diverse phenomena, and no single transform can be expected to describe them well; instead, we should consider models making sparse combinations of generating elements from several different transforms (1–4). Unfortunately, as soon as we start considering general collections of generating elements, the attempt to find sparse solutions enters mostly uncharted territory, and one expects at best to use plausible heuristic methods (5–8) and certainly to give up hope of rigorous optimality. In this article, we will develop some rigorous results showing that it can be possible to find optimally sparse representations by efficient techniques in certain cases.
Suppose we are given a dictionary D of generating elements {d_{k}}, each one a vector in C^{N}, which we assume normalized: dd_{k} = 1. The dictionary D can be viewed a matrix of size N × L, with generating elements for columns. We do not suppose any fixed relationship between N and L. In particular, the dictionary can be overcomplete and contain linearly dependent subsets, and in particular need not be a basis. As examples of such dictionaries, we can mention: wavelet packets and cosine packets dictionaries of Coifman et al. (3), which contain L = N log(N) elements, representing transient harmonic phenomena with a variety of durations and locations; wavelet frames, such as the directional wavelet frames of Ron and Shen (9), which contain L = CN elements for various constants C > 1; and the combined ridgelet/wavelet systems of Starck, Candès, and Donoho (8, 10). Faced with such variety, we cannot call individual elements in the dictionary basis elements; we will use the term atom instead; see refs. 2 and 11 for elaboration on the atoms/dictionary terminology and other examples.
Given a signal S ∈ C^{N}, we seek the sparsest coefficient vector γ̱ in C^{L} such that Dγ̱ = S. Formally, we aim to solve the optimization problem 1 Here the ℓ^{0} norm ∥γ̱∥_{0} is simply the number of nonzeros in γ̱. While this problem is easily solved if there happens to be a unique solution Dγ̱ = S among general (not necessarily sparse) vectors γ̱, such uniqueness does not hold in any of our applications; in fact, these mostly involve L ≫ N, where such uniqueness is, of course, impossible. In general, solution of (P_{0}) requires enumerating subsets of the dictionary looking for the smallest subset able to represent the signal; of course, the complexity of such a subset search grows exponentially with L.
It is now known that in several interesting dictionaries, highly sparse solutions can be obtained by convex optimization; this has been found empirically (2, 12) and theoretically (13–15). Consider replacing the ℓ^{0} norm in (P_{0}) by the ℓ^{1} norm, getting the minimization problem 2 This can be viewed as a kind of convexification of (P_{0}). For example, over the set of signals that can be generated with coefficient magnitudes bounded by 1, the ℓ^{1} norm is the largest convex function less than the ℓ^{0} norm; (P_{1}) is in a sense the closest convex optimization problem to (P_{0}).
Convex optimization problems are extremely well studied (16), and this study has led to a substantial body of algorithmic knowledge and highquality software. In fact, (P_{1}) can be cast as a linear programming problem and solved by modern interior point methods (2), even for very large N and L.
Given the tractability of (P_{1}) and the general intractibility of (P_{0}), it is perhaps surprising that for certain dictionaries solutions to (P_{1}), when sufficiently sparse, are the same as the solutions to (P_{0}). Indeed, results in refs. 13–15 show that for certain dictionaries, if there exists a highly sparse solution to (P_{0}) then it is identical to the solution of (P_{1}); and, if we obtain the solution of (P_{1}) and observe that if it happens to be sparse beyond a certain specific threshold, then we know (without checking) that we have also solved (P_{0}).
Previous work studying this phenomenon (13–15) considered dictionaries D built by concatenating two orthobases Φ and Ψ, thus giving that L = 2N. For example, one could let Φ be the Fourier basis and Ψ be the Dirac basis, so that we are trying to represent a signal as a combination of spikes and sinusoids. In this dictionary, the latest results in ref. 15 show that if a signal is built from <0.914 spikes and sinusoids, then this is the sparsest possible representation by spikes and sinusoids, i.e., the solution of (P_{0}), and it is also necessarily the solution found by (P_{1}).
We are aware of numerous applications where the dictionaries would not be covered by the results just mentioned. These include: cases where the dictionary is overcomplete by more than a factor 2 (i.e., L > 2N), and where we have a collection of nonorthogonal elements, for example, a frame or other system. In this article, we consider general dictionaries D and obtain results paralleling those in refs. 13–15. We then sketch applications in the more general dictionaries.
We address two questions for a given dictionary D and signal S: (i) uniqueness, under which conditions is a highly sparse representation necessarily the sparsest possible representation; and (ii) equivalence, under which conditions is a highly sparse solution to the (P_{0}) problem also necessarily the solution to the (P_{1}) problem?
Our analysis avoids reliance on the orthogonal subdictionary structure in refs. 13–15; our results are just as strong as those of ref. 13 while working for a much wider range of dictionaries. We base our analysis on the concept of the Spark of a matrix, the size of the smallest linearly dependent subset. We show that uniqueness follows whenever the signal S is built from fewer than Spark(D)/2 atoms. We develop two bounds on Spark that yield previously known inequalities in the twoorthobasis case but can be used more generally. The simpler bound involves the quantity M(G), the largest offdiagonal entry in the Gram matrix G = D^{H}D. The more general bound involves μ_{1}(G), the size of the smallest group of nondiagonal magnitudes arising in a single row or column of G and having a sum ≥1. We have Spark(D) > μ_{1} ≥ 1/M. We also show, in the general dictionary case, that one can determine a threshold on sparsity whereby every sufficiently sparse representation of S must be the unique minimal ℓ^{1} representation; our thresholds involve M(G) and a quantity μ_{1/2}(G) related to μ_{1}.
We sketch three applications of our results.
Separating Points, Lines, and Planes: Suppose we have a data vector S representing 3D volumetric data. It is supposed that the volume contains some number of points, lines, and planes in superposition. Under what conditions can we correctly identify the lines and planes and the densities on each? This is a caricature of a timely problem in extragalactic astronomy (17).
Robust Multiuser Private Broadcasting: J different individuals encode information by superposition in a single signal vector S that is intended to be received by a single recipient. The encoded vector must look like random noise to anyone but the intended recipient (including the transmitters), it must be decodable perfectly, and the perfect decoding must be robust against corruption of some limited number of entries in the vector. We present a scheme that achieves these properties.
Blind Identification of Sources: Suppose that a random signal S is a superposition of independent components taken from an overcomplete dictionary D. Without assuming sparsity of this representation, we cannot in general know which atoms from D are active in S or what there statistics are. However, we show that it is possible, without any sparsity, to determine the activity of the sources, exploiting higherorder statistics.
Note: After presenting our work publicly we learned of related work about to be published by R. Gribonval and M. Nielsen (18). Their work initially addresses the same question of obtaining the solution of (P_{0}) by instead solving (P_{1}), with results paralleling ours. Later, their work focuses on analysis of more special dictionaries built by concatenating several bases.
The TwoOrthobasis Case
Consider a special type of dictionary: the concatenation of two orthobases Φ and Ψ, each represented by N × N unitary matrices; thus D = [Φ, Ψ]. Let φ̱_{i} and ψ̱_{j} (1 ≤ i, j ≤ N) denote the elements in the two bases. Following ref. 13 we define the mutual incoherence between these two bases by It is easy to show (13, 15) that 1/ ≤ M ≤ 1; the lower bound is attained by a basis pair such as spikes and sinusoids (13) or spikes and Hadamard–Walsh functions (15). The upper bound is attained if at least one of the vectors in Φ is also found in Ψ. A condition for uniqueness of sparse solutions to (P_{0}) can be stated in terms of M; the following is proved in refs. 14 and 15.
Theorem 1: Uniqueness.
A representation S = Dγ̱ is necessarily the sparsest possible if ∥γ̱∥_{0} < 1/M.
In words, having obtained a sparse representation of a signal, for example by (P_{1}) or by any other means, if the ℓ^{0} norm of the representation is sufficiently small (<1/M), we conclude that this is also the (P_{0}) solution. In the best case where 1/M = , the sparsity requirement translates into ∥γ̱∥_{0} < .
A condition for equivalence of the solution to (P_{1}) with that for (P_{0}) can also be stated in terms of M; see refs. 14 and 15.
Theorem 2: Equivalence.
If there exists any sparse representation of S satisfying ∥γ̱∥_{0} < ( − 0.5)/M, then this is necessarily the unique solution of (P_{1}).
Note the slight gap between the criteria in the above two theorems. Recently, Feuer and Nemirovsky (19) managed to prove that the bound in the equivalence theorem is sharp, so this gap is unbridgeable.
To summarize, if we solve the (P_{1}) problem and observe that it is sufficiently sparse, we know we have obtained the solution to (P_{0}) as well. Moreover, if the signal S has sparse enough representation to begin with, (P_{1}) will find it.
These results correspond to the case of dictionaries built from two orthobases. Both theorems immediately extend in a limited way to nonorthogonal bases Φ and Ψ (13). We merely replace M by the quantity M̃ in the statement of both results: However, as we may easily have M̃ ≥ 1, this is of limited value.
A Uniqueness Theorem for Arbitrary Dictionaries
We return to the setting of the Introduction; the dictionary D is a matrix of size N × L, with normalized columns {d_{k}}. An incoming signal vector S is to be represented by using the dictionary by S = Dγ̱. Assume that we have two different suitable representations, i.e. 3 Thus the difference of the representation vectors, δ̲ = γ̱_{1} − γ̱_{2}, must be in the null space of the representation: D(γ̱_{1} − γ̱_{2}) = Dδ̲ = 0. Hence some group of columns from D must be linearly dependent. To discuss the size of this group we introduce terminology.
Definition 1, Spark:
Given a matrix A we define σ = Spark(A) as the smallest possible number such that there exists a subgroup of σ columns from A that are linearly dependent.
Clearly, if there are no zero columns in A, then σ ≥ 2, with equality only if two columns from A are linearly dependent. Note that, although spark and rank are in some ways similar, they are totally different. The rank of a matrix A is defined as the maximal number of columns from A that are linearly independent, and its evaluation is a sequential process requiring L steps. Spark, on the other hand, is the minimal number of columns from A that are linearly dependent, and its computation requires a combinatorial process of complexity 2^{L} steps. The matrix A can be of full rank and yet have σ = 2. At the other extreme we get that σ ≤ Min{L, Rank(A) + 1}.
As an interesting example, let us consider a twoortho basis case made of spikes and sinusoids. Here D = [Φ, Ψ], Φ = I_{N} the identity matrix of size N × N, and Ψ = F_{N} the discrete Fourier transform matrix of size N × N. Then, supposing N is a perfect square, using the Poisson summation formula we can show (13) that there is a group of 2 linearlydependent vectors in this D, and so Spark(D) ≤ 2. As we shall see later, Spark(D) = 2.
We now use the Spark to bound the sparsity of nonunique representations of S. Let σ = Spark(D). Every nonuniqueness pair (γ̱_{1}, γ̱_{2}) generates δ̲ = γ̱_{1} − γ̱_{2} in the column nullspace; and 4 On the other hand, the ℓ^{0} pseudonorm obeys the triangle inequality ∥γ̱_{1} − γ̱_{2}∥_{0} ≤ ∥γ̱_{1}∥_{0} + ∥γ̱_{2}∥_{0}. Thus, for the two supposed representations of S, we must have 5 Formalizing matters:
Theorem 3: Sparsity Bound.
If a signal S has two different representations S = Dγ̱_{1} = Dγ̱_{2}, the two representations must have no less than Spark(D) nonzero entries combined.
From Eq. 5, if there exists a representation satisfying ∥γ̱_{1}∥_{0} < σ/2, then any other representation γ̱_{2} must satisfy ∥γ̱_{2}∥_{0} > σ/2, implying that γ̱_{1} is the sparsest representation. This gives:
Corollary 1: Uniqueness.
A representation S = Dγ̱ is necessarily the sparsest possible if ∥γ̱∥_{0} < Spark(D)/2.
We note that these results are sharp, essentially by definition.
Lower Bounds on Spark(D).
Let G = D^{H}D be the Gram matrix of D. As every entry in G is an inner product of a pair of columns from the dictionary, the diagonal entries are all 1, owing to our normalization of D's columns; the offdiagonal entries have magnitudes between 0 and 1.
Let 𝒮 = {k_{1}, … , k_{s}} be a subset of indices, and suppose the corresponding columns of D are linearly independent. Then the corresponding leading minor G_{𝒮} = (G_{ki}, k_{j} : i, j = 1, … , s) is positive definite (20). The converse is also true. This proves
Lemma 1.
σ ≤ Spark(D) ifandonlyifevery (σ − 1) × (σ − 1) leading minor of G is positive definite.
To apply this criterion, we use the notion of diagonal dominance (20): If a symmetric matrix A = (A_{ij}) satisfies A_{ii} > ∑_{j≠i} A_{ij} for every i, then A is positive definite. Applying this criterion to all principal minors of the Gram matrix G leads immediately to a lower bound for Spark(G).
Theorem 4: LowerBound Using μ_{1}.
Given the dictionary D and its Gram matrix G = D^{H}D, let μ_{1}(G) be the smallest integer m such that at least one collection of m offdiagonal magnitudes arising within a single row or column of G sums at least to 1. Then Spark(D) > μ_{1}(G).
Indeed, by hypothesis, every μ_{1} × μ_{1} principal minor has its offdiagonal entries in any single row or column summing to <1. Hence every such principal minor is diagonally dominant, and hence every group of μ_{1} columns from D is linearly independent.
For a second bound on Spark, define M(G) = max_{i≠j} G_{i,j}, giving a uniform bound on offdiagonal elements. Note that the earlier quantity M(Ψ, Φ) defined in the twoorthobasis case agrees with the new quantity. Now consider the implications of M = M(G) for μ_{1}(G). Since we have to sum at least 1/M offdiagonal entries of size M to reach a cumulative sum equal or exceeding 1, we always have μ_{1}(G) ≥ 1/M(G). This proves
Theorem 5: Lower Bound Using M.
If the Gram matrix has all offdiagonal entries ≤ M, 6 Combined with our earlier observations on Spark, we have: a representation with fewer than (1 + 1/M)/2 nonzeros is necessarily maximally sparse.
Refinements.
In general, Theorems 4 and 5 can contain slack, in two ways. First, there may be additional dictionary structure to be exploited. For example, return to the special twoorthobasis case of spikes and sinusoids: D = [Φ, Ψ], Φ = I_{N} and Ψ = F_{N}. Here M = 1/, and so the theorem says that Spark(D) > . In fact, the exact value is 2, so there is room to do better. This improvement follows directly from the uncertainty theorem given in refs. 14 and 15, leading to Spark(D) ≥ 2/M. The reasoning is general and applies to any twoorthobasis dictionary, giving
Theorem 6: Improved Lower Bound, Two Orthobasis Case.
Given a twoorthobasis dictionary with mutual coherence value at most M, Spark(D) ≥ 2/M.
This bound, together with Corollary 1, gives precisely theorem 1 in refs. 14 and 15. Returning again to the spikes/sinusoids example, from M = 1/ we have Spark(D) ≥ 2. On the other hand, if N is a perfect square, the spike train example provides a linearlydependent subset of 2 atoms, and so in fact σ = 2.
Second, Theorem 5 can contain slack simply because the metric information in the Gram matrix is not sufficiently expressive on linear dependence issues. We may construct examples where Spark(D) ≫ 1/M(G). Our bounds measure diagonal dominance rather than positive definiteness, and this is well known to introduce a gap in a wide variety of settings (20).
For a specific example, consider a twobasis dictionary made of spikes and exponentially decaying sinusoids (13). Thus, let Φ = I_{N} and, for a fixed α ∈ (0, 1), construct the nonorthogonal basis Ψ = {ψ}, where ψ(i) = α^{i} exp{(2πk/N)i}. Let 𝒮_{1} and 𝒮_{2} be subsets of indices in each basis. Now suppose this pair of subsets generates a linear dependency in the combined representation, so that, for appropriate N vectors of coefficients ρ and ω, Then, as ψ(i) = α^{i}ψ(i) it is equally true that Hence, there would have been a linear dependency in the twoorthobasis case [I_{N}, F_{N}], with the same subsets and with coefficients ρ̃(k) = α^{k}ρ(k), ∀k, and ω̃(k) = ω(k). In short, we see that On the other hand, picking α close to zero, we get that the basis elements in Ψ decay so quickly that the decaying sinusoids begin to resemble spikes; thus M(I_{N}, Ψ) → 1 as α → 0.
This is a special case of a more general phenomenon, invariance of the Spark under row and column scaling, combined with noninvariance of the Gram matrix under row scaling. For any diagonal nonsingular matrices W_{1} (N × N) and W_{2} (L × L), we have while the Gram matrix G = G(D) exhibits no such invariance; even if we force normalization so that diag(G) = I_{L}, we have only invariance against the action of W_{2} (column scaling), but not W_{1} (row scaling). In particular, while the positive definiteness of any particular minor of G(W_{1}D) is invariant under such row scaling, the diagonal dominance is not.
Upper Bounds on Spark.
Consider a concrete method for evaluating Spark(D). We define a sequence of optimization problems, for k = 1, … , L: 7 Letting γ̱ denote the solution of problem (R_{k}), we get 8 The implied ℓ^{0} optimization is computationally intractable. Alternatively, any sequence of vectors obeying the same constraints furnishes an upper bound. To obtain such a sequence, we replace minimization of the l_{0} norm by the more tractable ℓ^{1} norm. Define the sequence of optimization problems for k = 1, 2, … , L: 9 (The same sequence of optimization problems was defined in ref. 13, which noted the formal identity with the definition of analytic capacities in harmonic analysis, and called them capacity problems.) The problems can in principle be tackled straightforwardly by using linear programming solvers. Denote the solution of the (Q_{k}) problem by γ̱. Then, clearly ∥γ̱∥_{0} ≤ ∥γ̱∥_{0}, so 10
An Equivalence Theorem for Arbitrary Dictionaries
We now set conditions under which, if a signal S admits a highly sparse representation in the dictionary D, the ℓ^{1} optimization problem (P_{1}) must necessarily obtain that representation.
Let γ̱_{0} denote the unique sparsest representation of S, so Dγ̱_{0} = S. For any other γ̱_{1} providing a representation (i.e. Dγ̱_{1} = S) we therefore have ∥γ̱_{1}∥_{0} > ∥γ̱_{0}∥_{0}. To show that γ̱_{0} solves (P_{1}), we need to show that ∥γ̱_{1}∥_{1} ≥ ∥γ̱_{0}∥_{1}. Consider the reformulation of (P_{1}): 11 If the minimum is no less than zero, the ℓ^{1} norm for any alternate representation is at least as large as for the sparsest representation. This in turn means that the (P_{1}) problem admits γ̱_{0} as a solution. If the minimum is uniquely attained, then the minimizer of (P_{1}) is γ̱_{0}.
As in refs. 13 and 15, we develop a simple lower bound. Let 𝒮 denote the support of γ̱_{0}. Writing the difference in norms in Eq. 11 as a difference in sums, and relabeling the summations, gives 12 where we introduced the vector δ̱ = γ̱_{1} − γ̱_{0}. Note that the right side of this display is positive only if i.e., only if at least 50% of the magnitude in δ is concentrated in 𝒮. This motivates a constrained optimization problem: 13 This new problem is closely related to Eq. 11 and hence also to (P_{1}). It has this interpretation: if val(C_{𝒮}) < ½, then every direction of movement away from γ̱_{0} that remains a representation of S (i.e., stays in the nullspace) causes a definite increase in the objective function for Eq. 11. Recording this formally,
Lemma 2: Less than 50% Concentration Implies Equivalence.
If supp(γ̱_{0}) ⊂ 𝒮, and val(C_{𝒮}) < ½, then γ̱_{0} is the unique solution of (P_{1}) from data S = Dγ̱_{0}.
On the other hand, if val(C_{𝒮}) ≥ ½, it can happen that γ̱_{0} is not a solution to (P_{1}).
Lemma 3.
If val(C_{𝒮}) ≥ ½, there exists a vector γ̱_{0} supported in 𝒮, and a corresponding signal S = Dγ̱_{0}, such that γ̱_{0} is not a unique minimizer of (P_{1}).
Proof:
Given a solution δ̱* to problem (C_{𝒮}) with value ≥ ½, pick any γ̱_{0} supported in 𝒮 with sign pattern arranged so that sign(δ*(k)γ̱_{0}(k)) < 0 for k ∈ 𝒮. Then consider γ̱_{1} of the general form γ̱_{1} = γ̱_{0} + tδ̱*, for small t > 0. Equality must hold in Eq. 12 for sufficiently small t, showing that ∥γ̱_{1}∥_{1} < ∥γ̱_{0}∥_{1} and so γ̱_{0} is not the unique solution of (P_{1}) when S = Dγ̱_{0}.
We now bring the Spark into the picture.
Lemma 4.
Let σ = Spark(D). There is a subset 𝒮* with #(𝒮*) ≤ σ/2 + 1 and val(C_{𝒮*}) ≥ ½.
Proof:
Let 𝒮 index a subset of dictionary atoms exhibiting linear dependence. There is a nonzero vector δ̱ supported in 𝒮 with Dδ̱ = 0. Let 𝒮* be the subset of 𝒮 containing the ⌈#(𝒮)/2⌉ largestmagnitude entries in δ̱. Then #(𝒮*) ≤ #(𝒮)/2 + 1 while It follows necessarily that val(C_{𝒮*}) > ½.
In short, the size of Spark(D) controls uniqueness not only in the ℓ^{0} problem, but also in the ℓ^{1} problem. No sparsity bound ∥γ̱_{0}∥_{0} < s can imply (P_{0})–(P_{1}) equivalence in general unless s ≤ Spark(D)/2.
Equivalence Using M(G). Theorem 7.
Suppose that the dictionary D has a Gram matrix G with offdiagonal entries bounded by M. If S = Dγ̱_{0} where ∥γ̱_{0}∥_{0} < (1 + 1/M)/2, then γ̱_{0} is the unique solution of (P_{1}).
We prove this by showing that if Dδ̱ = 0, then for any set 𝒮 of indices, 14 Then because #(𝒮) < (1 + 1/M)/2, it is impossible to achieve 50% concentration on 𝒮, and so Lemma 2 gives our result.
To get Eq. 14, we recall the sequence of capacity problems (Q_{k}) introduced earlier. By their definition, if δ̱ obeys Dδ̱ = 0, It follows (see also ref. 13) that for any set 𝒮, Result 14 and hence Theorem 7 follow immediately on substituting the following bound for the capacities:
Lemma 5.
Suppose the Gram matrix has offdiagonal entries bounded by M. Then To prove the lemma, we remark that as Dδ̱ = 0 then also Gδ̱ = D^{H}Dδ̱ = 0. In particular, (Gδ̱)_{k} = 0, where k is the specific index mentioned in the statement of the lemma. Now as δ_{k} = 1 by definition of (Q_{k}), and as G_{k,k} = 1 by normalization, in order that (Gδ̱)_{k} = 0 we must have Now Hence ∥δ̱∥_{1} ≥ 1/M + 1, and the lemma follows.
Equivalence Using μ_{1/2}(G).
Recall the quantity μ_{1} defined in our analysis of the uniqueness problem in the previous section. We now define an analogous concept relevant to equivalence.
Definition 2:
For G a symmetric matrix, μ_{1/2}(G) denotes the smallest number m such that some collection of m offdiagonal magnitudes arising in a single row or column of G sums at least to ½.
We remark that μ_{1/2}(G) ≤ ½μ_{1}(G) < μ_{1}(G). Using this notion, we can show:
Theorem 8.
Consider a dictionary D with Gram matrix G. If S = Dγ̱_{0} where ∥γ̱_{0}∥_{0} < μ_{1/2}(G), then γ̱_{0} is the unique solution of (P_{1}).
Indeed, let 𝒮 be a set of size #(𝒮) < μ_{1/2}(G). We show that any vector in the nullspace of D exhibits <50% concentration to 𝒮, so 15 The theorem then follows from Lemma 2.
Note again that Dδ̱ = 0 implies Gδ̱ = 0. In particular, (G − I)δ̱ = −δ̱. Let m = #𝒮 and let H = (H_{i,k}) denote the m × L matrix formed from the rows of G − I corresponding to indices in 𝒮. Then (G − I)δ̱ = −δ̱ implies 16 As is well known, H, viewed as a matrix mapping ℓ into ℓ, has its norm controlled by its maximum ℓ^{1} column sum: where By the assumption that μ_{1/2}(G) > #𝒮, It follows that ∥H∥_{(1,1)} < ½. Together with Eq. 16 this yields Eq. 15 and hence also the theorem.
Stylized Applications
We now sketch a few application scenarios in which our results may be of interest.
Separating Points, Lines, and Planes.
A timely problem in extragalactic astronomy is to analyze 3D volumetric data and quantitatively identify and separate components concentrated on filaments and sheets from those scattered uniformly through 3D space; see ref. 17 for related references, and ref. 8 for some examples of empirical success in performing such separations. It seems of interest to have a theoretical perspective assuring us that, at least in an idealized setting, such separation is possible in principle.
For sake of brevity, we work with specific algebraic notions of digital line, digital point, and digital plane. Let p be a prime (e.g., 127, 257) and consider a vector to be represented by a p × p × p array S(i_{1}, i_{2}, i_{3}) = S(i). As representers of digital points we consider the Kronecker sequences Point_{k}(i) = 1_{{k1}_{=i1}_{,k2}_{=i2}_{,k3}_{=i3}_{}} as k = (k_{1}, k_{2}, k_{3}) ranges over all triples with 0 ≤ k_{i} < p. We consider a digital plane Plane_{p}(j̱, k_{1}, k_{2}) to be the collection of all triples (i_{1}, i_{2}, i_{3}) obeying i_{j3} = k_{1}i_{j1} + k_{2}i_{j2} mod p, and a digital line Line_{p}(i_{o}, k) to be the collection of all triples i = i_{0} + ℓk mod p, ℓ = 0, … , p.
By using properties of arithmetic mod p, it is not hard to show the following:
Any digital plane contains p^{2} points.
Any digital line contains p points.
Any two digital lines are disjoint, or intersect in a single point.
Any two digital planes are parallel, or intersect in a digital line.
Any digital line intersects a digital plane not at all, in a single point, or along the whole extent of the digital line.
Suppose now that we construct a dictionary D consisting of indicators of points, of digital lines and of digital planes. These should all be normalized to unit ℓ^{2} norm. Let G denote the Gram matrix. We make the observation that Indeed, if d_{1} and d_{2} are each normalized indicator functions corresponding to subsets I_{1}, I_{2} ⊂ Z, then Using this fact and the above incidence estimates we get that the inner product between two planes or two lines could not exceed 1/p, while an inner product between a plane and a line or line and a point cannot exceed 1/. We obtain immediately:
Theorem 9.
If the 3D array can be written as a superposition of < digital planes, lines, and points, this is the unique sparsest representation, and it is also the unique minimal ℓ^{1} decomposition.
This shows that there is at least some possibility to separate 3D data rigorously into unique geometric components. It is in accord with recent experiments in ref. 8 successfully separating data into a relatively few such components.
Robust Multiuser Private Broadcasting.
Consider now a stylized problem in private digital communication. Suppose that J individuals are to communicate privately and without coordination over a broadcast channel with an intended recipient. The jth individual is to create a signal S_{j} that is superposed with all the other individuals' signals, producing S = ∑ S_{j}. The intended recipient gets a copy of S. The signals S_{j} are encoded in such a way that, to everyone but the intended recipient, encoders included, S looks like white Gaussian noise; participating individuals cannot understand each other's messages, but nevertheless the recipient is able to separate S unambiguously into its underlying components S_{i} and decipher each such apparent noise signal into meaningful data. Finally, it is desired that all this remain true even if the recipient gets a defective copy of S that may be badly corrupted in a few entries.
Here is a way to approach this problem. The intended recipient arranges things in advance so that each of the J individuals has a private codebook C_{j} containing K different N vectors whose entries are generated by sampling independent Gaussian white noises. The codebook is known to the jth encoder and to the recipient. An overall dictionary is created by concatenating these together with an identity matrix, producing The jth individual creates a signal by selecting a random subset of I_{j} indices in C_{j} and generating The positions of the indices i ∈ I_{j} and coefficients α are the meaningful data to be conveyed. Now each c_{j,i} is the realization of a Gaussian white noise; hence each S_{j} resembles such a white noise. Individuals don't know each other's codebooks, so they are unable to decipher each other's meaningful data (in a sense that can be made precise; compare ref. 21).
The recipient can extract all the meaningful data simply by performing minimum ℓ^{1} atomic decomposition in dictionary D. This will provide a list of coefficients associated with spikes and with the components associated with each codebook. Provided the participants encode a sparse enough set of coefficients, the scheme works perfectly to separate errors and the components of the individual message.
To see this, note that by using properties of extreme values of Gaussian random vectors (compare ref. 13) we can show that the normalized dictionary obeys where ɛ_{N} is a random variable that tends to zero in probability as N → ∞. Hence we have:
Theorem 10.
There is a capacity threshold C(D) such that, if each participant transmits an S_{j} using at most #(I_{j}) ≤ C(D) coefficients and if there are < C(D) errors in the recipient's copy of then the minimum ℓ^{1} decomposition of precisely recovers the indices I_{j} and the coefficients α. We have
Identification of Overcomplete Blind Sources.
Suppose that a statistically independent set of random variables X = (X_{i}) generates an observed data vector according to 17 where Y is the vector of observed data and A represents the coupling of independent sources to observable sensors. Such models are often called independent components models (22). We are interested in the case where A is known, but is overcomplete, in the sense that L ≫ N (many more sources than sensors, a realistic assumption). We are not supposing that the vector X is highly sparse; hence there is no hope in general to recover uniquely the components X generating a single realization Y. We ask instead whether it is possible to recover statistical properties of X; in particular higherorder cumulants.
The order 1 and 2 cumulants (mean and variance) are well known; the order 3 and 4 cumulants (essentially, skewness and kurtosis) are also well known; for general definitions see ref. 22 and other standard references, such as McCullagh (23). In general, the mth order joint cumulant tensor of a random variable U ∈ C^{N} is an mway array Cum(i_{1}, … , i_{m}) with 1 ≤ i_{m} ≤ N. If Y is generated in terms of the independent components model Eq. 17 then we have an additive decomposition: where Cum is the mth order cumulant of the scalar random variable X_{k}. Here (a_{k} ⊗ a_{k} ⊗ … a_{k}) denotes an mway array built from exterior powers of columns of A: Define then a dictionary D^{(m)} for mway arrays with kth atom the mthe exterior power of a_{k}: Now let S^{(m)} denote the morder cumulant array Cum. Then 18 where D^{(m)} is the dictionary and γ̱^{(m)} is the vector of morder scalar cumulants of the independent components random variable X. If we can uniquely solve this system, then we have a way of learning cumulants of the (unobservable) X from cumulants of the (observable) Y. Abusing notation so that M(D) ≡ M(D^{H}D) etc., we have In short, if M(D) < 1 then M(D^{(m)}) can be very small for large enough m. Hence, even if the M value for determining a notverysparse X from Y is unfavorable, the corresponding M value for determining Cum from Cum can be very favorable, for m large enough. Suppose for example that A is an N by L system with M(A)= 1/, but X has (say) N/3 active components. This is not very sparse; on typical realizations ∥X∥_{0} = N/3 ≫ = 1/M. In short, unique recovery of X based on sparsity will not hold. However, the secondorder cumulant array (the covariance) Cum still has N/3 active components, while M(D^{(2)}) = M(D^{(2)})^{2} = 1/N; as N/3 < 1/2M, the cumulant array Cum is sparsely represented and Cum is recovered uniquely by ℓ^{1} optimization. Generalizing, we have
Theorem 11.
Consider any coupling matrix A with M < 1. For all sufficiently large m ≥ m*(L, M), the minimum ℓ^{1} solution of the cumulant decomposition problem (18) is unique (and correct!).
Acknowledgments
Partial support was from National Science Foundation Grants DMS 0077261, ANI008584 (ITR), DMS 0140698 (FRG), and DMS 9872890 (KDI), the Defense Advanced Research Projects Agency, the Applied and Computational Mathematics Program, and Air Force Office of Scientific Research Multidisciplinary University Research Initiative Grant 95P496209610028.
 Accepted December 20, 2002.
 Copyright © 2003, The National Academy of Sciences
References
 ↵
 Mallat S
 ↵
 ↵
 Coifman R,
 Meyer Y,
 Wickerhauser M V
 ↵
 Wickerhauser M V
 ↵
 Berg A P,
 Mikhael W B

 Huo X
 ↵
Starck, J. L., Donoho, D. L. & Candès, E. J. (2002) Astron. Astrophys., in press.
 ↵
 ↵
 Starck J L,
 Candès E J,
 Donoho D L
 ↵
 ↵
 Chen S S
 ↵
 ↵
 Elad M,
 Bruckstein A M
 ↵
 ↵
 Gill P E,
 Murray W,
 Wright M H
 ↵
 Donoho D L,
 Levi O,
 Starck J L,
 Martinez V
 ↵
Gribonval, R. & Nielsen, M. (2002) IEEE Trans. Inf. Theory, in press.
 ↵
Feuer, A. & Nemirovsky, A. (2002) IEEE Trans. Inf. Theory, in press.
 ↵
 Horn R A,
 Johnson C R
 ↵
 ↵
 Hyvarinen A,
 Karhunen J,
 Oja E
 ↵
 McCullagh P