Sparse timefrequency representations
See allHide authors and affiliations

Communicated by Mitchell J. Feigenbaum, The Rockefeller University, New York, NY, March 1, 2006 (received for review January 20, 2006)
Abstract
Auditory neurons preserve exquisite temporal information about sound features, but we do not know how the brain uses this information to process the rapidly changing sounds of the natural world. Simple arguments for effective use of temporal information led us to consider the reassignment class of timefrequency representations as a model of auditory processing. Reassigned timefrequency representations can track isolated simple signals with accuracy unlimited by the timefrequency uncertainty principle, but lack of a general theory has hampered their application to complex sounds. We describe the reassigned representations for white noise and show that even spectrally dense signals produce sparse reassignments: the representation collapses onto a thin set of lines arranged in a frothlike pattern. Preserving phase information allows reconstruction of the original signal. We define a notion of “consensus,” based on stability of reassignment to timescale changes, which produces sharp spectral estimates for a wide class of complex mixed signals. As the only currently known class of timefrequency representations that is always “in focus” this methodology has general utility in signal analysis. It may also help explain the remarkable acuity of auditory perception. Many details of complex sounds that are virtually undetectable in standard sonograms are readily perceptible and visible in reassignment.
Timefrequency analysis seeks to decompose a onedimensional signal along two dimensions, a time axis and a frequency axis; the best known timefrequency representation is the musical score, which notates frequency vertically and time horizontally. These methods are extremely important in fields ranging from quantum mechanics (1–5) to engineering (6, 7), animal vocalizations (8, 9), radar (10), sound analysis and speech recognition (11–13), geophysics (14, 15), shaped laser pulses (16–18), the physiology of hearing, and musicography. ^{¶}
Hainsworth, S. W., Macleod, M. D. & Wolfe, P. J., IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 21–24, 2001, Mohonk, NY, p. 26.
A central question of auditory theory motivates our study: what algorithms do the brain use to parse the rapidly changing sounds of the natural world? Auditory neurons preserve detailed temporal information about sound features, but we do not know how the brain uses it to process sound. Although it is accepted that the auditory system must perform some type of timefrequency analysis, we do not know which type. The many inequivalent classes of timefrequency distributions (2, 3, 6) require very different kinds of computations: linear transforms include the Gabor transform (19), quadratic transforms [known as Cohen’s class (2, 6)] include the Wigner–Ville (1) and Choi–Williams (20) distributions, and higherorder in the signal, include multitapered spectral estimates (21–24), the Hilbert–Huang distribution (25, 26), and the reassigned spectrograms (27–32) whose properties are the subject of this article.Results and Discussion
The auditory nerve preserves information about phases of oscillations much more accurately than information about amplitudes, a feature that inspired temporal theories of pitch perception (33–37). Let us consider what types of computation would be simple to perform given this information. We shall idealize the cochlea as splitting a sound signal χ(t) into many component signals χ(t,ω) indexed by frequency ω χ is the Gabor transform (19) or shorttime Fourier transform (STFT) of the signal x(t). The parameter σ is the temporal resolution or time scale of the transform, and its inverse is the frequency resolution or bandwidth. The STFT χ is a smooth function of both t and ω and is strongly correlated for Δt < σ or Δω < 1/σ. In polar coordinates it decomposes into magnitude and phase, χ(t, ω) = χ(t, ω)e^{iφ(t,ω)} . A plot of χ ^{2} as a function of (t, ω) is called the spectrogram (3, 38), sonogram (8), or Husimi distribution (2, 4) of the signal x(t). We call φ(t, ω) the phase of the STFT; it is well defined for all (t, ω) except where χ = 0. We shall base our representation on φ.
We can easily derive two quantities from φ: the time derivative of the phase, called the instantaneous frequency (31), and the current time minus the frequency derivative of the phase (the local group delay), the instantaneous time: Neural circuitry can compute or estimate these quantities from the information in the auditory nerve: the time derivative, as the time interval between action potentials in one given fiber of the auditory nerve, and the frequency derivative from a time interval between action potentials in nearby fibers, which are tonotopically organized (34).
Any neural representation that requires explicit use of ω or t is unnatural, because it entails “knowing” the numerical values of both the central frequencies of fibers and the current time. Eq. 2 affords a way out: given an estimate of a frequency and one of a time, one may plot the instantaneous estimates against each other, making only implicit use of (t, ω), namely, as the indices in an implicit plot. So for every pair (t, ω), the pair (t_{ins} , ω _{ins} ) is computed from Eq. 2 , and the two components are plotted against each other in a plane that we call (abusing notation) the (t_{ins} , ω _{ins} ) plane. More abstractly, Eq. 2 defines a transformation T The transformation is signaldependent because φ has to be computed from Eq. 1 , which depends on the signal x, hence the subscript {x} on T.
The transformation given by Eqs. 2 and 3 has optimum timefrequency localization properties for simple signals (27, 28). The values of the estimates for simple test signals are given in Table 1.
So for a simple tone of frequency ω _{0} , the whole (t, ω) plane is transformed into a single line, (t, ω _{0} ); similarly, for a “click,” Dirac delta function localized at time t _{0} , the plane is transformed into the line (t _{0} , ω); and for a frequency sweep where the frequency increases linearly with time as αt, the plane collapses onto the line αt _{ins} = ω _{ins} . (The full expression in the frequency sweep case is given in Appendix.) So for these simple signals the transformation (t, ω) → (t_{ins} , ω _{ins} ) has a simple interpretation as a projection to a line that represents the signal. The transformation’s timefrequency localization properties are optimal, because these simple signals, independently of their slope, are represented by lines of zero thickness. Under the STFT the simple signals above transform into strokes with a Gaussian profile, with vertical thickness 1/σ (tones) and horizontal thickness σ (clicks).
These considerations lead to a careful restatement of the uncertainty principle. In optics it is well known that there is a difference between precision and resolution. Resolution refers to the ability to establish that there are two distinct objects at a certain distance, whereas precision refers to the accuracy with which a single object can be tracked. The wavelength of light limits resolution, but not precision. Similarly, the uncertainty principle limits the ability to separate a sum of signals as distinct objects, rather than the ability to track a single signal. The bestknown distribution with optimal localization, the Wigner–Ville distribution (1), achieves optimal localization at the expense of infinitely long range in both frequency and time. Because it is bilinear, the Wigner transform of a sum of signals causes the signals to interfere or beat, no matter how far apart they are in frequency or time, seriously damaging the resolution of the transform. This nonlocality makes it unusable in practice and led to the development of Cohen’s class. In contrast, it is readily seen from Eq. 1 that the instantaneous timefrequency reassignment cannot cause a sum of signals to interfere when they are further apart than a Fourier uncertainty ellipsoid; therefore, it can resolve signals as long as they are further apart than the Fourier uncertainty ellipsoid, which is the optimal case. Thus, reassignment with instantaneous timefrequency estimates has optimal precision (unlimited) and optimal resolution (strict equality in the uncertainty relation).
We shall now derive the formula needed to implement numerically this method. First, the derivatives of the transformation defined by Eq. 2 should be carried out analytically. The Gaussian window in the STFT has a complex analytic structure; defining z = t/σ − iσω we can write the STFT as So up to the factor e ^{(σω)2/2}, the STFT is an analytic function of z (29). Defining we obtain in closed form So the mapping is a quotient of convolutions: where ∗ is the complex conjugate. Therefore, computing the instantaneous timefrequency transformation requires only twice the numerical effort of an STFT.
Any transformation F: (ω, t) → (ω _{ins} , t_{ins} ) can be used to transform a distribution in the (ω, t) plane to its corresponding distribution in the (ω _{ins} , t_{ins} ) plane. If the transformation is invertible and smooth, the usual case for a coordinate transformation, this change of coordinates is done by multiplying by the Jacobian of F the distribution evaluated at the “old coordinates” F ^{−1}(ω _{ins} , t_{ins} ). Similarly, the transformation T given by Eqs. 2 and 3 transforms a distribution f in the (ω,t) plane to the (ω _{ins} , t_{ins} ) plane, called a “reassigned f ” (28–30, 38) ^{§} . However, because T is neither invertible nor smooth, the reassignment requires an integral approach, best visualized as the numerical algorithm shown in Fig. 1: generate a fine grid in the (t, ω) plane, map every element of this grid to its estimate (t_{ins} , ω _{ins} ), and then create a twodimensional histogram of the latter. If we weight the histogrammed points by a positivedefinite distribution f(t, ω), the histogram g(t_{ins} , ω _{ins} ) is the reassigned or remapped f; if the points are unweighted (i.e., f = 1), we have reassigned the uniform (Lebesgue) measure. We call the class of distributions so generated the reassignment class. Reassignment has two essential ingredients: a signaldependent transformation of the timefrequency plane, in our case the instantaneous timefrequency mapping defined by Eq. 2 , and the distribution being reassigned. We shall for the moment consider two distributions: that obtained by reassigning the spectrogram χ ^{2} , and that obtained by reassigning 1, the Lebesgue measure. Later, we shall extend the notion of reassignment and reassign χ itself to obtain a complex reassigned transform rather than a distribution.
Neurons could implement a calculation homologous to the method shown in Fig. 1, e.g., by using varying delays (39) and the “manyareequal” logical primitive (40), which computes histograms.
Despite its highly desirable properties, the unwieldy analytical nature of the reassignment class has prevented its wide use. Useful signal estimation requires us to know what the transformation does to both signals and noise. We shall now demonstrate the usefulness of reassignment by proving some important results for white noise. Fig. 2shows the sonogram and reallocated sonogram of a discrete realization of white noise. In the discrete case, the signal is assumed to repeat periodically, and a sum replaces the integral in Eq. 1 . If the signal has N discrete values we can compute N frequencies by Fourier transformation, so the timefrequency plane has N ^{2} “pixels,” which, having been derived from only N numbers, are correlated (19). Given a discrete realization of white noise, i.e., a vector with N independent Gaussian random numbers, the STFT has exactly N zeros on the fundamental tile of the (t,ω) plane, so, on average, the area per zero is 1. These zeros are distributed with uniform density, although they are not independently distributed.
Because the zeros of the STFT are discrete, the spectrogram is almost everywhere nonzero. In Fig. 2 the reassigned distributions are mostly zero or near zero: nonzero values concentrate in a frothlike pattern covering the ridges that separate neighboring zeros of the STFT. The Weierstrass representation theorem permits us to write the STFT of white noise as a product over the zeros × the exponential of an entire function of quadratic type: where Q(z) is a quadratic polynomial and z _{i} is the zeros of the STFT. The phase φ = Im ln G and hence the instantaneous estimates in Eq. 2 become sums of magneticlike interactions where and similarly for the instantaneous time; so the slow manifolds of the transformation T, where the reassigned representation has its support, are given by equations representing equilibria of magneticlike terms.
The reassigned distributions lie on thin strips between the zeros, which occupy only a small fraction of the timefrequency plane; see Appendix for an explicit calculation of the width of the stripes in a specific case. The fraction of the timefrequency plane occupied by the support of the distribution decreases as the sequence becomes longer, as in Fig. 3; therefore, reassigned distributions are sparse in the timefrequency plane. Sparse representations are of great interest in neuroscience (41–43), particularly in auditory areas, because most neurons in the primary auditory cortex A1 are silent most of the time (44–46).
Signals superposed on noise move the zeros away from the representation of the pure signal, creating crevices. This process is shown in Fig. 4. When the signal is strong and readily detectable, its reassigned representation detaches from the underlying “froth” of noise; when the signal is weak, the reassigned representation merges into the froth, and if the signal is too weak its representation fragments into disconnected pieces.
Distributions are not explicitly invertible; i.e., they retain information on features of the original signal, but lose some information (for instance about phases) irretrievably. It would be desirable to reassign the full STFT χ rather than just its spectrogram χ ^{2} . Also the auditory system preserves accurate timing information all of the way to primary auditory cortex (47). We shall now extend the reassignment class to complexvalued functions; to do this we need to reassign phase information, which requires more care than reassigning positive values, because complex values with rapidly rotating phases can cancel through destructive interference. We build a complexvalued histogram where each occurrence of (t_{ins} , ω _{ins} ) is weighted by χ(t, ω). We must transform the phases so preimages of (t_{ins} , ω _{ins} ) add coherently. The expected phase change from (t, ω) to (t_{ins} , ω _{ins} ) is (ω + ω _{ins} )(t_{ins} − t)/2, i.e., the average frequency times the time difference. This correction is exact for linear frequency sweeps. Therefore, we reassign χ by histogramming (t_{ins} , ω _{ins} ) weighted by χ(t, ω)e ^{i(ω+ωins)(tins −t)/2}. Unlike standard reassignment, the weight for complex reassignment depends on both the point of origin and the destination.
The complex reassigned STFT now shares an important attribute of χ that neither χ ^{2} nor any other positivedefinite distribution possesses: explicit invertibility. This inversion is not exact and may diverge significantly for spectrally dense signals. However, we can reconstruct simple signals directly by integrating on vertical slices, as in Fig. 5, which analyzes a chirp (a Gaussianenveloped frequency sweep). Complex reassignment also allows us to define and compute synchrony between frequency bands: only by using the absolute phases can we check whether different components appearing to be harmonics of a single sound are actually synchronous.
We defined the transformation for a single value of the bandwidth σ. Nothing prevents us from varying this bandwidth or using many bandwidths simultaneously, and indeed the auditory system appears to do so, because bandwidth varies across auditory nerve fibers and is furthermore volumedependent. Performing reassignment as a function of time, frequency, and bandwidth we obtain a reassigned wavelet representation, which we shall not cover in this article. We shall describe a simpler method: using several bandwidths simultaneously and highlighting features that remain the same as the bandwidth is changed. When we intersect the footprints of representations for multiple bandwidths we obtain a consensus only for those features that are stable with respect to the analyzing bandwidth (31), as in Fig. 6. For spectrally more complex signals, distinct analyzing bandwidths resolve different portions of the signal. Yet the lack of predictability in the auditory stream precludes choosing the right bandwidths in advance. In Fig. 6, the analysis proceeds through many bandwidths, but only those bands that are locally optimal for the signal stand out as salient through consensus. Application of consensus to real sounds is illustrated in Fig. 7. This principle may also support robust pattern recognition in the presence of primary auditory sensors whose bandwidths depend on the intensity of the sound.
A final remark about the use of timing information in the auditory system is in order. Because G(z) (Eq. 4 ) is an analytic function of z, its logarithm is also analytic away from its zeros, and so
satisfies the Cauchy–Riemann relations, from where the derivatives of the spectrogram can be computed in terms of the derivatives of the phase and vice versa, as shown (29):
So, mathematically, timefrequency analysis can equivalently be done from phase or intensity information. In the auditory system, though, these two approaches are far from equivalent: estimates of χ and its derivatives must rely on estimating firing rates in the auditory nerve and require many cycles of the signal to have any accuracy. As argued before, estimating the derivatives of φ only requires computation of intervals between few spikes.
Our argument that timefrequency computations in hearing use reassignment or a homologous method depends on a few simple assumptions: (i) we must use simple operations from information readily available in the auditory nerve, mostly the phases of oscillations; (ii) we must make only implicit use of t and ω; (iii) phases themselves are reassigned; and (iv) perception uses a multiplicity of bandwidths. Assumptions i and ii led us to the definition of reassignment, iii led us to generalize reassignment by preserving phase information, and iv led us to define consensus. The result is interesting mathematically because reassignment has many desirable properties. Two features of the resulting representations are pertinent to auditory physiology. First, our representations make explicit information that is readily perceived yet is hidden in standard sonograms. We can perceive detail below the resolution limit imposed by the uncertainty principle that is stable across bandwidth changes, as in Fig. 5. Second, the resulting representations are sparse, which is a prominent feature of auditory responses in primary auditory cortex.
Appendix: Instantaneous Time Frequency for Some Specific Signals
Frequency Sweep.
x(t) + e ^{iαt2/2}: from where
GaussianEnveloped Tone.
x(t) = exp(−(t − t
_{0})
^{2}
/2λ
^{2}
+ iω
_{0}
(t − t
_{0}
)), then the STFT has support on an ellipsoid centered at (t
_{0}, ω
_{0}
) with temporal width
Acknowledgments
We thank Dan Margoliash, Sara Solla, David Gross, and William Bialek for discussions shaping these ideas; A. James Hudspeth, Carlos Brody, Tony Zador, Partha Mitra, and the members of our research group for spirited discussions and critical comments; and Fernando Nottebohm and Michale Fee for support of T.J.G. This work was supported in part by National Institutes of Health Grant R01DC007294 (to M.O.M.) and a Career Award at the Scientific Interface from the BurroughsWellcome Fund (to T.J.G.).
Footnotes
 ^{§}To whom correspondence should be addressed. Email: magnasco{at}rockefeller.edu

Author contributions: T.J.G. and M.O.M. designed research, performed research, and wrote the paper.

↵ ¶ Hainsworth, S. W., Macleod, M. D. & Wolfe, P. J., IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 21–24, 2001, Mohonk, NY, p. 26.

Conflict of interest statement: No conflicts declared.
 Abbreviations:
 STFT,
 shorttime Fourier transform.
Abbreviation:

Freely available online through the PNAS open access option.
 © 2006 by The National Academy of Sciences of the USA
References
 ↵
 ↵

↵
 Cohen L.
 ↵
 ↵

↵
 Cohen L.

↵
 Hogan J. A. ,
 Lakey J. D.

↵
 Greenewalt C. H.
 ↵

↵
 Chen V. C. ,
 Ling H.

↵
 Riley M. D.
 ↵
 ↵

↵
 Steeghs P. ,
 Baraniuk R. ,
 Odegard J.
 PapandreouSuppappola A.

↵
 Vasudevan K. ,
 Cook F. A.
 ↵
 ↵

↵
 Marian A. ,
 Stowe M. C. ,
 Lawall J. R. ,
 Felinto D. ,
 Ye J.

↵
 Gabor D.
 ↵

↵
 Thomson D. J.

↵
 Slepian D. ,
 Pollak H. O.
 ↵
 ↵

↵
 Huang N. E. ,
 Shen Z. ,
 Long S. R. ,
 Wu M. L. C. ,
 Shih H. H. ,
 Zheng Q. N. ,
 Yen N. C. ,
 Tung C. C. ,
 Liu H. H.

↵
 Yang Z. H. ,
 Huang D. R. ,
 Yang L. H.
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵

↵
 Cariani P. A. ,
 Delgutte B.

↵
 Cariani P. A. ,
 Delgutte B.

↵
 Julicher F. ,
 Andor D. ,
 Duke T.

↵
 Flandrin P.
 ↵

↵
 Hopfield J. J. ,
 Brody C. D.
 ↵
 ↵
 ↵

↵
 Coleman M. J. ,
 Mooney R.

↵
 DeWeese M. R. ,
 Wehr M. ,
 Zador A. M.
 ↵

↵
 Elhilali M. ,
 Fritz J. B. ,
 Klein D. J. ,
 Simon J. Z. ,
 Shamma S. A.