New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Independent component analysis for brain fMRI does not select for independence

Contributed by I. Daubechies, April 15, 2009 (received for review July 18, 2007)
Related Article
 In This Issue Jun 30, 2009
Abstract
InfoMax and FastICA are the independent component analysis algorithms most used and apparently most effective for brain fMRI. We show that this is linked to their ability to handle effectively sparse components rather than independent components as such. The mathematical design of better analysis tools for brain fMRI should thus emphasize other mathematical characteristics than independence.
Independent component analysis (ICA), a framework for separating a mixture of different components into its constituents, has been proposed for many applications, including functional magnetic resonance imaging (fMRI) (1–3). Separating a signal mixture into its components is impossible in general; however, many cases of interest allow for special underlying assumptions that make the problem tractable. ICA algorithms decompose into a sum of signals that “optimize independence.” Several algorithms and software packages are available for ICA (4, 5).
The first blind source separation in fMRI via ICA used InfoMax (1); other ICA algorithms for fMRI followed, such as FastICA (2, 5). These algorithms work well if the components have “generalized Gaussian” distributions of the form p(x) = Cexp(−αx^{γ}), with γ≠2. More general ICA algorithms that assume less about the components can separate into independent components mixtures for which Infomax and FastICA fail. Nevertheless, these 2 are the most used ICA algorithms for brain fMRI.
Stochastic processes are independent if the distribution of either remains the same if the other is conditioned to any subregion of their range. Detecting deviations from independence requires large samples. In fMRI experiments, brain activity is measured in small volumetric regions or voxels v ∈ V, at times t_{n}, n = 1,…,N. (fMRI measures brain function via the associated increase of oxygenenriched blood flow. The hemodynamic response function is the flow's time profile for 1 pulse in brain activity; from a signal analysis point of view, it blurs the signal in time.) Often the voxels outnumber N by far. Thus, one often prefers to view the voxelindex v as labeling the samples over which independence is sought (spatial ICA, or SICA), rather than the t_{n} (temporal ICA, or TICA).
In the linear model for brain activation (6), the total brain activity X(t,v) is assumed to be a linear superposition of the different ongoing brain activity patterns: X(t,v) = ∑_{ℓ = 1}^{L} M_{ℓ}(t)C_{ℓ}(v), where the C_{ℓ} correspond to the brain activity patterns, and the “mixing matrix” M gives the corresponding time courses. At high signal amplitudes, saturation effects “spoil” linearity; nevertheless, the linear model is remarkably effective. We shall stick to it here.
Typically, the brain function under study is turned “off” and “on” by having subjects perform a task during defined periods, punctuated by either resting states or other tasks. The activation map of interest C_{act}(v) associated with a time course M_{act}(t) related to the task paradigm, is then identified via a statistical analysis. When a strict paradigm is not possible, or to capture more complex taskrelated time dependence, “blind” decomposition techniques are of interest; they decompose X(t,v) without reference to the task paradigm. The time paradigm used (convolved with the hemodynamic response), is used only to identify, among the C_{ℓ}(v) found by the algorithm, the one with the closest resembling time course.
ICA (or another source separation algorithm) is thus used to identify the different components C_{ℓ} by decomposing X(t,v). In brain fMRI, there is no physical reason for the spatial samples to correspond to different activity patterns C_{ℓ}(v) with independent distributions. Several of the seminal papers suggest that ICA decomposition is particularly effective when the brain patterns one seeks are spatially sparse, with negligible or no overlap (1). Such components are “near” to independence in the following sense. Consider 2 binaryvalued components C_{1} and C_{2}. Define V_{1} (V_{2}) as the collection of all v where C_{1}(v) [C_{2}(v)] equals 1, and V_{12} as V_{1} ∩ V_{2}. Then the random processes that consist in evaluating C_{1}(v) and C_{2}(v) for a voxel v, picked randomly (uniformly distributed) in V, are independent if and only if
Spatial Variation Captured by ICA Algorithms
We first investigate the ability of InfoMax and FastICA to extract finegrained spatially varying brain function maps. To this end, we carried out an ICA analysis of the data described in ref. 9, which demonstrated distributed patterns of activation, to find out to what extent ICA analysis could reproduce these findings. We first briefly recall the setup and results of this experiment (9).
Overlapping Representations of Faces and Objects in Ventral Temporal Cortex.
The subjects in the original experiment were shown visual stimuli during the “on” intervals of a simple on/off block time paradigm. The stimuli were images that, depending on their subject, belonged to 1 of 8 categories: Faces, Houses, Cats, Bottles, Scissors, Shoes, Chairs, or a control category of Scrambles (random noise patches). Images viewed in each 24s stimulus block belonged to the same category. Each run consisted of 8 stimulus blocks (1 for each category), separated by 12s rest intervals. The blocks visited the categories in a different random order for each run. To keep their attention on the images, the subjects were assigned a 1back memory task. The object of the study was, however, not linked to this task but to the representation in the ventral temporal cortex of the different broad classes of objects.
For each subject, 12 fMRI time series were recorded. The objectselective cortex was determined (for each subject separately) by selecting the voxels whose responses differed significantly across categories, including voxels that showed no or weak activation for 1 or several categories. The ventral temporal objectselective cortex S was defined as the intersection of the anatomically defined ventral temporal cortex and the functionally defined objectselective cortex. The data, reduced to 12 time series for each voxel in S, were then split into training and test sets (even vs. oddnumbered time series).
Based on the training dataset, S was partitioned into 8 subsets S_{c} by determining for each voxel v ∈ S the category c for which the response deviated most from the average response (over categories) of that voxel. This partition was used in the second, most important, step of the study. In the first step, the activation patterns (restricted to S), extracted for each stimulus time block in the test dataset were correlated with each of the categoryspecific activation patterns on S observed in the training dataset. The correlation scores were systematically and significantly higher “within category” than “between category,” so that the activition pattern in S for any particular stimulus block in the test set allows one to identify, with high confidence, the category the subject viewed during that time block. (Detailed data are in ref. 9; see also Fig. 2 below.)
In the second step, similar pairwise correlations were computed, except that, for every pair c,c′, the computation of the correlation between the 2 activation patterns was carried out summing over only voxels in S\(S_{c} ∪ S_{c′}), i.e., excluding the voxels that were most active for either c or c′. Although the correlation scores for these amputated objectselective zones were smaller than before, they still permitted correct identification of the categories viewed during the test time blocks; with high confidence. The information about the identity of the categories was thus stored not just in a small specialized zone but also, almost as importantly, in a more distributed spatially varying pattern, with lower amplitudes.
ICA Analysis of This Dataset.
Because N was too large for the ICA algorithms, we first performed a dimensionality reduction via PCA, retaining only the L PCA components with largest singular value, with L < N. Choosing L appropriately is nontrivial (10); we used an automatic estimation method for each dataset, maximizing the evidence of the model (11), as implemented in the FSL package (12) [the FSL (and its subtools BET, FLIRT, FEAT, MELODIC) can be downloaded from www.fmrib.ox.ac.uk/fsl]. (We also checked that augmenting L beyond our cutoff dimension did not impact the contrast of the components isolated by the ICA algorithm.) The output of this PCA was used as input for spatial ICA, to obtain L “spatially independent” components. We used 2 algorithms: InfoMax, as implemented in the NIS package (nisica, available at http://kraepelin.wpic.pitt.edu/nis), and FastICA, as implemented in the FSL package (12); in both cases the standard nonlinearities were used, maximizing the nonGaussianity of the spatial sources. (Note: the FSL package outputs Zscore maps for the components isolated by the FastICA algorithm, whereas the NIS package provides the ICA maps themselves. To make the results comparable, the raw ICA components computed within FSL were used, not the corresponding Zscore maps.)
As shown by the original data analysis in ref. 9, the patterns of activation associated with the different separate categories are highly overlapping; InfoMax and FastICA lacked sufficient sensitivity to distinguish the responses across categories. When either ICA algorithm was applied to 1 of the original 8block functional time series (containing 1 block of trials for each of the 8 categories), the time series of the resulting ICA components did not correlate strongly with any of the categoryspecific reference functions (which consisted of single 24s “on” blocks). Instead, a single consistently taskrelated (CTR) component was produced systematically, with a time series that correlated strongly with the full 8block time paradigm. This result persisted when several original time series were concatenated (“creating” a signal in which the categoryspecific paradigms had several active blocks).
To identify categoryspecific activation maps by ICA, we reorganized the data. Only voxels in the ventral temporal cortex were studied, as in the original data analysis in ref. 9. For these voxels, new time series were constructed by concatenating blocks (consisting of 24 s of task plus the following 12 s of rest) of images corresponding to the same category, adjusting the mean of each time series, and highpass filtering to avoid baseline drifts. This was done separately for training and test datasets, thus creating 16 new composite time series (2 for each of the 8 categories), each containing 6 blocks of stimulus of a unique category (see Fig. 1). For each of these, we identified the component of interest generated by ICA; because the analysis used a dataset with trials corresponding to 1 category only, we expected these components to contain categoryspecific information. From the resulting 16 maps, we computed the correlations between C_{c,training} and C_{c′,test} within and between category, i.e., for c = c′ and c ≠ c′. The results are given in Fig. 2 Left, for both InfoMax and FastICA; the withincategory correlations are in most cases significantly higher than the betweencategory scores, leading to an identification accuracy of 82% for FastICA and 89% for Infomax, significantly better than chance level, albeit not as high as the 96% identification accuracy obtained in ref. 9. This high identification accuracy confirmed that categoryspecific information was indeed present in the CTR components estimated from the composite runs; it also confirmed that the preliminary dimensionreducing PCA step had not removed the correlations observed in ref. 9.
Next, we concentrated on the “offpeak” information in these ICA components. Following the lead of ref. 9, we partitioned S into 8 subsets
It is standard practice, in the use of ICA for fMRI, to threshold the CTR component obtained. If a binary ground truth of activation is available, this can be done by means of ROC curves. Typically such “ground truth” maps are not available, and one thresholds based on the deviation from the mean in the distribution of the voxel amplitudes (see, e.g., refs. 13 and 14). The thresholds determined by standard practice are fairly high, retaining only the voxels with the very highest response. A post hoc analysis of the ICA CTR maps in our case shows that this would have eliminated, on average, >60% of the voxels in the offpeak regions; restricting the correlation computations to only the remaining voxels reduced the identification accuracy to much smaller values, for both FastICA and InfoMax. This shows the importance of voxels that experience lower amplitude activation, and of the spatial variation of this loweramplitude activation, in encoding categoryspecific information; it also shows that the ICA algorithms were sufficiently sensitive to pick up the loweramplitude spatial variation in the distributed categoryspecific components that enabled good identification accuracy from the offpeak regions. (As stated above, the ICA algorithms did not pick out the categoryspecific activation patterns from the original data; we refer here to their effectiveness in capturing the correct spatial variation in amplitude of the activation patterns, even among amplitude levels that are normally discounted.)
This ICAanalysis of the data of ref. 9 show that Info Max and FastICA “work:” they capture finegrained spatially varying information, even though the independence hypothesis is surely violated. We next examine “independence” more closely.
Mathematical Independence: Generalities
Two stochastic variables X and Y are independent if their joint distribution is a product of their marginal distributions. More precisely, X and Y are independent if, for all possible choices of the 4 numbers a ≤ b and c ≤ d, we have Prob(a ≤ X ≤ bc ≤ Y ≤ d) = Prob(a ≤ X ≤ b), i.e., the probability of observing X in an interval is the same whether we condition the values of Y to a subrange or not. For X and Y to be independent, it is necessary and sufficient that their mutual information equal zero: ∑_{k,ℓ}P_{k,l}^{X,Y} log(P_{k,l}^{X,Y}) − ∑_{k}P_{k}^{X} log(P_{k}^{X}) − ∑_{ℓ}P_{ℓ}^{Y} log(P_{ℓ}^{Y} = 0), where P_{1}^{A},…,P_{ℓ}^{A},… are the probabilities with which a random variable A can take its different values a_{1},…,a_{ℓ},…. Requiring independence for 2 random variables is a very strong condition. Independent random variables are uncorrelated; the converse is not true.
Given linear mixtures Z_{r} = ∑M_{r,s}X_{s},r = 1,… R of independent random variables X_{s},s = 1,…S (with R ≥ S), one can recover the X_{s} by identifying the S × R matrix W for which the combinations ∑_{r = 1}^{R}W_{s,r}Z_{r} are independent. Often this matrix W is identified via an iterative algorithm that seeks to minimize the mutual information
where
Many ICA algorithms minimize a proxy functional instead of Eq. 1 or simplify the optimization by making assumptions about the distributions of the components. This is the case for InfoMax and FastICA: Both algorithms perform better if the components have distributions of type p(x) = Cexp(−αx^{γ}), with γ ≠ 2 (1, 2, 5).
Mathematical Independence: Some Simulations
In this numerical study of InfoMax and FastICA (15), we study the respective roles of independence, sparseness, and separatedness in the success of the algorithms. We experiment on simple models where it is easy to change each of these characteristics separately.
In the InfoMax and FastICA algorithms, as adapted to brainfMRI studies, the data are first prewhitened via a PCA analysis, leading to X^{WH.}(t,v) = ∑_{ℓ = 1}^{k}M_{ℓ}^{WH.}(t)C_{ℓ}^{WH.}(v). Next, the algorithms identify an (orthogonal) L × L matrix W such that the C_{ℓ}(v) = ∑_{k = 1}^{L}W_{ℓ,k}C_{k}^{WH.}(v) are as “independent” as possible. FastICA determines W such that the values (C_{ℓ}(v))_{v ∈ V} are distributed as unGaussianlike as possible, as measured by the kurtosis or the negentropy. In InfoMax, the different components are fed into a neural network optimizing the mutual information of the “output components.” For details, see refs. 5 and 16. These ICA algorithms fail when the components to be separated have Gaussian distributions.
We construct input mixtures (of independent or dependent components) for InfoMax and FastICA, and run the algorithms to produce their estimate of the “unmixed” components.
Both algorithms come in several forms; we selected those to correspond with the implementations on real fMRI data, described above. The selected InfoMax algorithm is adapted to heavytailed components; the nonlinear function characterizing its neural network is 1/(1 + e^{−x}). For the FastICA algorithm the nonlinear function approximating the negentropy is (in the notation of ref. 5) g(y) = y^{3}; the iteration follows the “symmetric approach”.
Our examples are deliberately chosen simple, with few parameters. We consider only 2 components C_{1} and C_{2}, and 2 “observations,” at times t_{1}, t_{2}. Each component is a realization of a stochastic process that is itself a composite of 2 processes: one “activation process,” restricted to a subset of V, and a “background process” on its complement; the components are both of the type C_{i}(v) = χ_{Vi}(v)x_{v}^{i} + [1 − χ_{Vi}(v)]y_{v}^{i},i = 1,2, where the V_{i},i = 1,2 are different subsets of V (and can be picked differently for different examples), where χ_{A} stands for the indicator function of A ⊂ V (i.e., χ_{A}(v) = 1 if v ∈ A, χ_{A}(v) = 0 otherwise), and where x^{1}, x^{2}, y^{1} and y^{2} are 4 independent random variables, of which, as v ranges over V, the x_{v}^{i} and y_{v}^{i} are independent realizations. The “background” random variables y^{1} and y^{2} have the same cumulative density function (cdf) Φ_{y}(u) = [1 + e^{−1−u}]^{−1}; the “activation” random variables x^{1} and x^{2} also have identical cdf, equal to either Φ_{x}(u) = [1 + e^{2−u}]^{−1} or Φ_{x}(u) = [1 + e^{2(2−u)}]^{−1}, depending on the example. (For the first choice, the parameters of our ICA implementations provide optimal “detectability” in the sense that the nonlinear function defined by the parameter setting of the algorithm coincides with the cdf of the signal source; for the second, there is a slight mismatch, as can be expected in realistic applications.) Finally, the mixtures of C_{1} and C_{2} given as input to the algorithms are X(t_{1},v) = 0.5C_{1}(v) + 0.5C_{2}(v) and X(t_{2},v) = 0.3C_{1}(v) + 0.7C_{2}(v).
The joint distribution function of the components C_{1} and C_{2} is easy to compute. The probability density functions (pdf) of x and y are, respectively, φ_{x}(u) = Φ_{x}′(u) and φ_{y}(u) = Φ_{y}′(u) = (cosh[(u + 1)/2])^{−2}. The pdfs ψ_{1} and ψ_{2} of C_{1} and C_{2}, respectively, are then given by ψ_{1}(u) = (#V_{1})/(#V)φ_{x}(u) + [1 − (#V_{1})/(#V)]φ_{y}(u), ψ_{2}(u) = (#V_{2})/(#V)φ_{x}(u) + [1 − (#V_{2})/(#V)]φ_{y}(u). Likewise, the joint pdf ψ_{{1,2}} of C_{1} and C_{2} is
In each example below, the algorithms start from the 2 mixtures and are asked to “unmix” them; in all examples, a simple visual inspection of the mixtures already clearly indicates that there are several different components; the success of the algorithms is judged by the extent to which the “unmixed” output components they provide are close to the original 2 components, i.e., show a contrast boundary at only the edges of V_{1} for one of the components and only at the edges of V_{2} for the other.
Consider now the following 4 examples, each specified by the choices of V_{1}, V_{2}, and the cdf Φ_{x}. These examples are illustrated in Fig. 3, with V = {1,…,100}×{1,…,100}.
Example 1. In this example, V_{1} = {11,…,40}×{21,…,70}, and V_{2} = {31,…,80}×{41,…,80}. By Eq. 2, C_{1} and C_{2} are independent. For the cdf Φ_{x} we choose
Example 2. All choices are identical to Example 1, except that the cdf Φ_{x} is picked differently:
Example 3. A different variant of Example 1: All choices are identical, except the location of V_{2}, now shifted to {43,…,92}×{53,…,92}. The pdfs of C_{1} and C_{2} are unaffected. The new V_{2} and V_{1} do not intersect, i.e., C_{1} and C_{2} are spatially separated, not independent. Fig. 3, Case 3 shows that both InfoMax and FastICA successfully identify the 2 original components C_{1} and C_{2}.
Example 4. The sets V_{1} and V_{2} are as in Example 3, but Φ_{x} is as in Example 2; C_{1}, C_{2} are not independent. Fig. 3, Case 4 shows that both algorithms both successfully separate C_{1} and C_{2}.
Examples 3 and 4 show that InfoMax and FastICA can identify different but not truly independent components; this is not surprising, because these components, even if not independent, may give the “least dependent” decomposition.
In Examples 2 and 4, the components are very similar—they differ only by a shift of V_{2}, resulting in truly independent C_{1}, C_{2} in Example 2, but some dependence in Example 4. Yet the 2 algorithms fail to identify the correct components in Example 2 (the independent case) and succeed in Example 4. The failure for both algorithms to identify these truly independent components is due to the fairly strong assumptions about the pdfs of the individual components. For comparison purposes, we also analyzed the mixtures from Example 2 with a more sophisticated ICA algorithm [see supporting information (SI)] that has fewer underlying assumptions on the pdfs of the constituent components, at the price of taking longer to converge. This alternative ICA algorithm not only “learns” the components but also their pdfs, modeling them as a mixture of Gaussians, of which the parameters are “learned.” The Fig. 3 Right shows that this more complex algorithm identifies the correct independent components in Example 2, and does less well in the separated case (Example 4), more consistent with expectation for an ICA. ICA algorithms of this type do not seem to be used in fMRI studies; in our trials, they did not perform well on fMRI data.
To further investigate the relative influence of separation and independence of both InfoMax and FastICA, we repeated the experiment for more gradual shifts of V_{2}: for α ranging from −15 to 15, we pick V_{2}^{α} = {31+α,80+α}×{41+α,80+α}, leaving all the other settings the same as in Examples 2 and 4. In this family, α = 0 corresponds to mathematical independence (= Example 2), whereas we have complete separation (no overlap of V_{1}, V_{2}) for α ≥ 10. For each α, we find the unmixing matrix W_{α}, with both FastICA and InfoMax, and quantify the quality of the decomposition by the norm n_{α} = ∥Id − W_{α} · M∥ of the difference between the 2 × 2 identity matrix and the product of the unmixing matrix W_{α} and the mixing matrix M. For an accurate decomposition, this product is close to the identity, and n_{α} is close to 0. The less accurate the decomposition, the larger n_{α}. The value of n_{α} depends on the particular realizations of C_{1}, C_{2}; we computed (for each α) 99 different realizations, and we show the median, 1st and 3rd quartiles, and the highest and lowest values of n_{α} in Fig. 4 (leftmost images). “Success of separation” can also be judged by visual inspection; by this crietrium, the components were visually perfectly separated whenever n_{α} < 0.2; they were never separated when n_{α} > 0.3.
In addition to this “medium boxes” family of examples, where the components occupy, respectively, 15% and 20% of the 100 × 100 voxels in V, we repeated the numerical experiment with both sparser and lesssparse choices for V_{1}, V_{2}. In the “small boxes” family, V_{1} = {41,…,60}×{31,…,50}, V_{2,α} = {57+α,…,81+α}×{46+α,…,65+α}, occupying 4% and 5%, respectively, of V; in the “large boxes” family, V_{1} = {1,…,48}×{1,…,100}, V_{2,α} = {25+α,…,74+α}×{1,…,100}, occupying 48% and 50% (see top of Fig. 4).
Within each family, Fig. 4 shows the relative importance of independence. For medium and large boxes, the dip at α = 0 shows that FastICA is most successful when the components are indeed independent; InfoMax becomes successful only for larger α, when there is (almost) no overlap. For small boxes, both algorithms identify the components for all α, even when the overlap is large. Comparison across the 3 families shows that sparsity of the components (measured by #V_{2}/#V, #V_{1}/#V) affects the success rate of either algorithm much more than (in)dependence (measured by the deviation).
Mathematical Independence: fMRI Experiments
The fMRI experiments discussed here parallel the simulations above; they are inspried by ref. 7. We analyze* the results of experiments with 2 components, depending on 1 parameter, 1 value of which gives 2 independent components; for other values there is either more or less overlap. We analyze the results with InfoMax and FastICA and discuss their rate of success for different parameter values (17).
Experimental Design.
The experimental paradigm consisted of stimulating the right and left visual hemifields using a pair of 8Hz flashing checkerboard wedges. The subjects were asked to focus on a bright dot at the center of a circular field; the visual stimulus consisted of flashing checkerboards in large wedges of this field. Two parameters, θ and α, characterized each run; quantifying the spatial and temporal overlaps in the paradigm.
The value of θ determined the positions of the 2 flashing checkerboard wedges. Both wedges spanned an angle of 120°. One wedge, S_{1}, remained in the same position; S_{2}(θ) was either completely separated from or overlapped with S_{1}. In the overlap, the contrast of the flashing checkerboard was more pronounced than in regions covered by only 1 wedge. (See Fig. 5.) The contrast levels of the photic stimulations (5% for nonoverlapping wedges, 100% for overlapping wedges) ensured the linearity of the model.
The values of θ and the spanning angles were selected so that the experiment mirrored the setup of the numerical experiments above: for θ(= 100), the overlap of the wedges gives Within the full disk, the indicator functions of the wedges S_{1} and S_{2}(100) are thus independent. For θ = 140, the overlap is larger; for θ = 0, the wedges are completely separated. Because retinotopic mapping preserves the relative importance of spatial areas, spatial independence of the visual stimuli translates (in first approximation) into spatial independence of the corresponding retinotopic activity patterns. Within brain areas that preserve retinotopic organization, we thus have the same (or similar) spatial relationships between the activity patterns for the wedges as for the wedge stimuli themselves.
A second parameter, α, indexes shifts in the time paradigms for the 2 wedges. (Even though we studied only SICA, we wanted to test dependence on shifts in the time paradigm seen in ref. 7.) For each α, both wedges are “on” (“off”) half the time, and follow identical time patterns; the only difference is a time lag for S_{2}. The 3 values of α (
Image Acquisition and Preprocessing Stages.
Whole brain images were acquired for 3 healthy subjects (2 males, 1 female) with a 3T Siemens Allegra scanner. A T1weighted structural image was acquired to localize the anatomy of the activated areas. The functional images were acquired by using a gradientecho EPI sequence, (TR = 1,000 ms; flip angle = 60°; field of view 192 mm × 192 mm; matrix 64 × 64; slice thickness 7 mm; distance factor of 13%); each run contained 244 volumes of 15 axial slices acquired in the anteriortoposterior phase encode direction. Preprocessing stages included the application of the FSL FLIRT motion correction algorithm (12) on all EPI images after discarding the first 6 volumes of each run; in order to include only brain voxels in the analysis, we applied a brain extraction algorithm [FSL BET (12)] on motioncorrected data.
Data Analysis of the Preruns.
We performed a general linear model (GLM) regression (6) on each of the preruns using the FSL FEAT package (12), to establish a “ground truth” activation map for each of the individual components. Voxels for which the time series, restricted to stimulus periods, showed a departure of at least 2 standard deviations from their average rest behavior were designated as true positives. The activation map resulting from the stimulation of the full flashing checkerboard was used for a second analysis (see below). The β maps B_{1}, B_{2,θ} from the GLM analysis on the preruns were used to define ground truth maps for the brain patterns created by visualization of the individual wedges S_{1}, S_{2}(θ).
Using standard GLM statistical analysis, we also defined (separately, for each of the 3 preruns, for each session) binary ground truth maps G(v). We first extracted from (F(v,t_{n}))_{n=1,…,N} the component that most distinguished between “on” and “off” paradigms, and evaluated its significance by comparing it with the remainder of the demeaned (F(v,t_{n}))_{n=1,…,N}; when the distinguishing component was significant at the P = 0.01 level for both v and a contiguous neighbor v′ ≠ v we set G(v) = 1, otherwise G(v) = 0.
Application of SICA.
We carried out both InfoMax and FastICA analyses on the experimental runs for all (α,θ) pairs, using the same methods and software package as the ICA analysis of the Haxby data (see above); this included, in particular, a dimensional reduction via PCA, with a dimensional cutoff chosen so that the output component strength did not change when the dimension was increased. For each of the separated components produced by the ICA algorithm, we computed correlation scores r_{1}(ℓ) and r_{2,α}(ℓ) between its associated time course M_{ℓ}(t) and the 2 timeparadigm functions φ_{1}(t) and φ_{2,α}(t). The component with the highest value of r_{1} (respectively r_{2,α}) was identified as the CTR component map C_{1}(v) [C_{2,θ}(v)] corresponding to S_{1} (S_{2,θ}). (This classification is equivalent to classification based on the GLM threshold τ.) To assess and compare the “quality” of the activity maps produced by the ICA algorithm for the different experimental designs, receiver operating characteristics (ROC) curves were used.
Results for the Different Experimental Conditions.
For each α, i.e., for each of the 3 time paradigms, we compare the quality of the component separation by ICA for the 3 values of θ = 0,100, and 140. The spatial paradigm (and thus also the retinotopic brain activity) has 2 independent components when θ = 100°, because
where D is the whole disk, which translates to a similar relation for the corresponding activation patterns,
In the numerical simulations leading up to Fig. 4, we explored, for each of the small, medium, and large boxes families, the deviation from “true” independence in a finegrained manner (varying the overlap in small increments) that is not possible in our experimental setting. The distinction among the 3 families can be mimicked, however. Eq. 4 explicitly refers to the disk D as the reference with respect to which independence of S_{1} and S_{2,100} are assessed (i.e. D plays here the role of the square V in the numerical study). This means that the corresponding components C_{1} and C_{2,100} can be expected to be independent if we compare them within the region
We carried out 3 different ICA for each (α,θ)pair, varying the “horizon region”
Fig. 7 shows the results of our experiments, for both InfoMax and FastICA, for the 9 different (α,θ)pairs and the 3 different choices V =
For both algorithms we observe that: (i) for the choice
The ROC curves in Fig. 7 were for the data from 1 subject only. The results for the other 2 subjects were similar. Fig. S10 gives average AUCs over all 3 subjects; because the region 0 ≤ false positive rate ≤ 0.05 is the one of most interest, we restricted the AUC computation to this region only.
The averaged ROC powers for the different cases illustrate the relative importance of independence/separation/sparsity.
Role of independence.
When the reference region is
Role of separation.
For the parameter choices for which the error bars in Fig. S10 are not too wide (i.e., the cases where the mean AUC is most meaningful), there is virtually no difference between the cases θ = 0 (separated activity patterns) and θ = 140 (substantial overlap). The influence of spatial separation of the components, very noticeable in the numerical simulations in the previous section, is thus not so apparent here. However, the θ = 140 case shows substantial differences in effectiveness over the 3 test subjects, so it is not clear how much we can rely on it.
Role of sparsity.
For most (θ,α) [including all (θ,α) with small error bars], AUC_{V} (black) > AUC_{I} (red) > AUC_{D} (blue), for both InfoMax and FastICA. Switching from
In this experiment, the factor that most influences the success rate of the ICA algorithms is thus sparsity of the components; moreover, this is slightly more marked for InfoMax than for FastICA. Independence of the components plays a marginal role, at best.
Remark. The mention of sparsity in the introduction was within the framework of justifying the use of Independent Component Analysis in brain fMRI analysis: even if components were not truly independent, the heuristic argument went, they were “close to independent” if very sparse. The experiment described above is important in that it teases apart independence and sparsity; it points to sparsity (in its own right, not as promoting independence) as the crucial factor.
Independence Versus Sparsity: Discussion
Summary of our Findings so Far.

InfoMax and FastICA can capture, with considerable accuracy, finegrained spatial variability in activity patterns.

In our fMRI brain experiments, InfoMax systematically gave better results than FastICA.

In the purely numerical simulations with the small, medium, and large boxes, InfoMax is barely selective for independence; FastICA shows more selectivity towards independence. For both algorithms, sparsity of the components is more important.

Our second fMRI experiment confirms that sparsity affects the success of InfoMax and FastICA more than independence.

A more sophisticated ICA algorithm that performs better than InfoMax and FastICA on independent medium boxes, and is more sensitive to independence, performs less well on the fMRI experiments.
This strongly suggests that when InfoMax and FastICA are effective in brain fMRI, the underlying reason may be other than their striving for independence. Let us now examine sparsity more closely.
Sparsity.
We call a vector v = (v_{1},v_{2},…,v_{N}) Bsparse (B ≫ N) if at most B of its entries differ from 0. This notion is basisdependent; the basis with respect to which the Bsparseness is defined, must be specified. One can also define sparse random vectors, i.e., sparse vectors of which the components are random variables (r.v.). Fig. 8 Left shows realizations of noisy 2dimensional 1sparse random vectors, generated as follows: r := γr_{1} + (1−γ)r_{2}, where γ is an r.v. that takes values 1 and 0 only; r_{1},r_{2} are the 2dimensional random vectors r_{ℓ} := α_{ℓ}a + β_{ℓ}b, where α_{1},β_{2} have the same pdf p(t), and α_{2},β_{1} have the rescaled pdf p′(t) = 10 × p(10t). It follows that r is mostly aligned either with a, or with b.
Very similar figures are often used as “promotional material” for ICA: In this example it is clearly desirable to identify (from the data) the 2 vectors a and b; because these are not orthogonal, they cannot possibly be the outcome of a PCA. Hence (the argument goes) the need for ICA. Yet Fig. 8 depicts processes with 2 sparse rather than independent components: If one conditions r on having a large inner product with b, the distribution of its component along a will be affected. The 2 images of Fig. 8 differ only in that γ equals 1 in 50% for M_{1}, but only 30% for M_{2}. When given this input, InfoMax or FastICA identify the 2 special directions a and b correctly as the components. However, in the example given here, p(t) is gaussian; because ICA methods cannot separate mixtures of independent gaussian processes, the successful separation of components by InfoMax and FastICA underscores again their ability to identify sparse components.
Algorithms for Brain fMRI Adapted to Sparsity.
We conclude that, rather than decomposition methods that search for independent components, one should develop alternate decomposition methods, still striving to write X(t,v) = ∑_{ℓ = 1}^{L} M_{ℓ}(t)C_{ℓ}(v), but targeting decompositions into components that are optimally sparse (only a small number of voxels play a role in each C_{ℓ}) and/or separated (the number of voxels playing a role in more than one C_{ℓ} is small).
It is a good time to look for algorithms involving sparsity. Important progress has been made recently on the problem of recovering a sparse vector from underdetermined linear information. More precisely, if A is a L × N matrix, with L ≫ N, then it is typically impossible to recover an Ndimensional vector u from its image w = Au, or to recover a close approximation to u from w if [∑_{ℓ = 1}^{L}[w_{ℓ} − (Au)_{ℓ}]^{2}]^{1/2} ≤ ɛ. Often, one introduces constraints or penalizations to this type of problem to make it welldefined and wellposed. The knowledge that u is Bsparse, or is close to a B sparse vector (with B < L), is now known (18–21) to be a sufficient constraint to achieve this, even if the identity of the nonvanishing entries of u is unknown. (There are, of course, technical conditions on A, involving L, N, and B, and satisfied by large classes of matrices, that we shall not discuss here.) The mathematical study for this problem uses insights from computer science, statistics, nonlinear approximation and the geometry of finitedimensional Banach spaces. Several approaches have been proposed; of special interest are the ultrafast (sublinear in N) methods developed, in e.g., ref. 22, and the ℓ^{1}optimization methods in refs. 18–21, 23–25, equivalent to ℓ^{1}penalization methods (26). ℓ^{1} penalization can be connected with InfoMax and FastICA. Both implicitly assume that the independent components have pdfs of the form p(u) = Ce^{−αuγ}. Using this same assumption as a prior for independently distributed components in a Bayesian framework, in which X(v,t) is viewed as ∑_{ℓ}M_{ℓ}(t)C_{ℓ}(v) corrupted by Gaussian noise, leads to the search for C_{ℓ} that maximize where C contains all the appropriate normalization constants, and we assume the M_{ℓ} are normalized. Maximizing this amounts to maximizing the exponent; for γ = 1, this is an ℓ^{1}penalized functional, also used to obtain sparse decompositions. This may explain why InfoMax and FastICA are good at identifying sparse components.
In the quest for algorithms that are effective for fMRI, based on sparsity of either the components or their intersections, it will be important, to determine the basis with respect to which one expects/hopes for sparsity. In fMRI experiments subjects have as few distractions from the task at hand as possible, in the hope of minimizing the brain activity, so that functional activation maps are less hard to isolate. If this means the signal of interest is confined to a small region only, with little overlap with other activation maps, then sparsity in the voxel domain is likely. For an experiment like the one in ref. 9, where 8 different components all live in the same region of the ventral temporal cortex, sparsity in the voxel domain may be elusive. The original experimental data were not registered so as to make reduction to a common cortical surface model possible for all the subjects. Recently, the experiment was repeated, with such extra registration;† the corresponding new 2dimensional data on the cortical surface remains to be analyzed, aiming for sparsity not in the purely spatial domain (on the cortical surface), but in, e.g., a corresponding wavelet domain, appropriate for components that have strong peaks in some small areas but behave more smoothly elsewhere.
Acknowledgments
We thank C. Beckman, D. Donoho, and R. Nowak for their comments. This work was supported in part by National Science Foundation Grants DMS0421608 and DMS0245566, National Institute of Mental Health (NIMH) Grant MH067204, and NIMH Conte Center Grant MH62196.
Footnotes
 ^{1}To whom correspondence should be addressed. Email: ingrid{at}math.princeton.edu

This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected in 1998.

Author contributions: I.D., S.T., W.R., and J.D.C. designed research; I.D., E.R., S.T., M.B., C.G., and K.D. performed research; I.D., E.R., S.T., W.R., and J.H. contributed new reagents/analytic tools; E.R., S.T., M.B., C.G., and K.D. analyzed data; and I.D. and J.D.C. wrote the paper.

The authors declare no conflict of interest.

↵* Benharrosh M, Takerkart S, Cohen JD, Daubechies I, Richter W, Annual Meeting of the Organization for Human Brain Mapping, June 18–22, 2003, New York.

↵† Sabuncu MR, Singer BD, Bruan RE, Ramadge PJ, Haxby JV, Annual Meeting of theOrganization of Human Brain Mapping, June 11–15, 2006, Florence, Italy.
References
 ↵
 ↵
 ↵
 ↵
 Lee TW,
 Ziehe A,
 Orglmeister R,
 Sejnowski T
 ↵
 Hyvärinen A,
 Karhunen J,
 Oja E
 ↵
 ↵
 ↵
 ↵
 Haxby JV,
 et al.
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Golden C
 ↵
 Bell AJ,
 Sejnowski TJ
 ↵
 Takerkart S,
 Benharrosh M,
 Haxby J,
 Daubechies I
 ↵
 ↵
 Candès E
 ↵
 ↵
 ↵
 Gilbert A,
 et al.
 ↵
 ↵
 Gribonval R,
 Nielsen M
 ↵
 Cohen A,
 Dahmen W,
 DeVore R
 ↵
Citation Manager Formats
Sign up for Article Alerts
Article Classifications
 Physical Sciences
 Mathematics
 Biological Sciences
 Neuroscience