Neural representations of the content and production of human vocalization

Edited by Emery Brown, Massachusetts General Hospital, Boston, MA; received November 11, 2022; accepted May 10, 2023
May 30, 2023
120 (23) e2219310120

Significance

Speech research is largely bound to the study of humans and, thus, suffers from methodological limitations. Therefore, key mechanisms of speech production remain unclear. One central question is whether neural representations of speech content and motor production can be dissociated. We addressed this with an innovative paradigm where we combined magnetoencephalography, advanced multivariate pattern analysis, and a rule-based vocalization task. We could dissociate neural representations of content and production and further describe their temporal and spatial dynamics. Our results suggest that content has an abstract representation that allows to generalize across different production forms. With this study, we answer some essential questions of speech research and provide a fruitful framework for further noninvasive research.

Abstract

Speech, as the spoken form of language, is fundamental for human communication. The phenomenon of covert inner speech implies functional independence of speech content and motor production. However, it remains unclear how a flexible mapping between speech content and production is achieved on the neural level. To address this, we recorded magnetoencephalography in humans performing a rule-based vocalization task. On each trial, vocalization content (one of two vowels) and production form (overt or covert) were instructed independently. Using multivariate pattern analysis, we found robust neural information about vocalization content and production, mostly originating from speech areas of the left hemisphere. Production signals dynamically transformed upon presentation of the content cue, whereas content signals remained largely stable throughout the trial. In sum, our results show dissociable neural representations of vocalization content and production in the human brain and provide insights into the neural dynamics underlying human vocalization.
Vocal behavior is an essential component of human communication. Particularly speech, the spoken form of language, is a highly sophisticated skill exclusive to humans. Thereby, we can encode information not only in sound (overt speech) but also in thought (covert speech). These different speech forms imply functional independence of speech content and motor production. However, it remains an open question how content and production are represented neuronally and how the brain achieves a flexible mapping between the two.
In speech, two levels need to be distinguished. The lexical level, which refers to entire words, and the sublexical level, which refers to parts of words, such as phonemes (single vowel or consonant) or syllables (more than one vowel or consonant) (1). On both levels, speech engages a broad cortical network comprising the primary motor cortex (M1), premotor and supplementary motor cortices (PMC and SMA), as well as sensory and auditory cortices (28). Compared to phonemes, more complex sublexical speech leads to stronger activation in parts of the network (8), whereas lexical speech recruits additional areas for high-level speech processes like word selection and combination (6, 9, 10).
Several findings suggest that there is a neural representation of speech content which is to some degree independent of its motor production. This independence is intuitive on the lexical level. Indeed, on the lexical level, overt and covert speech were found to share a common cerebral network with similar activation patterns (1117), which differ primarily in activation magnitudes due to a different degree of executive motor control (15, 18). Furthermore, Broca’s area was found to act as a supramodal hub, exhibiting language-specific activation independent of the production form (19).
The motor-independent representation of speech content is less intuitive on the sublexical level, where content may be expected to be more tightly bound to motor production. Still, initial evidence also suggests production-independent representations of sublexical entities like syllables and phonemes. Efference copies in overt and covert speech were found to have similar patterns not only on the lexical but also on the sublexical syllable level, except for M1 recruitment exclusive to overt speech (2023). Furthermore, specifically on the syllable level, covert speech was found to yield articulatory representations in premotor regions, as well as acoustic representations in sensory and auditory cortices similar to overt speech (24). On the phoneme level, evidence remains sparse as most studies have been related to motor production (2528) and efference copies were so far only described in overt speech (29). Yet, indirect evidence from phoneme-related speech errors in covert speech suggests a motor-independent phoneme representation (30).
In sum, past work suggests that neural activity underlying lexical and sublexical vocalizations represents both speech content and motor production. However, the representational overlap of content and motor representations and to what extent it is possible to dissociate these aspects remain unclear. Furthermore, the dynamic interplay between emerging representations of content and motor production is not known, as most previous studies used neural data with inherently poor temporal resolution. These questions are particularly open on the phoneme level, where direct neuronal evidence is missing.
To address these questions, here we aimed, first, to independently manipulate and decode neuronal content and motor components of human phoneme vocalization and, second, to investigate their dynamic interplay across time. We recorded magnetoencephalography (MEG) while subjects performed a rule-based vocalization task dissociating the content and motor aspects of sublexical speech. Content (one of two vowels) and production (overt or covert) were instructed sequentially and in random order. Multivariate pattern analysis (MVPA) of time-resolved MEG data allowed us to characterize the format, overlap, and temporal dynamics of neural content and production representations.
With this approach, we were able to read out content and motor information several seconds before speech onset. The strength of neural information correlated with the degree of motor involvement. At the beginning of the trial, when only one variable was known, the isolated representations of content and production were similar. The production representation transformed once the content was known, whereas the content representation remained stable until the onset of vocalization.

Results

The Components of Vocalization Can Be Decoded Independently.

We recorded MEG while subjects performed a rule-based vocalization task. Participants had to overtly vocalize or covertly imagine the vocalization of two different vowels. During each trial, content (/u/ / /ə/) and production (vocalized/imagined) were instructed sequentially with visual cues (Fig. 1A). Each cue lasted 100 ms and was followed by a 2-s delay. At the end of the trial, a brief dimming of the fixation point served as a go cue for the onset of vocalization or imagination. The order of instruction was randomized, as was the assignment of the instructed content or production to the visual cues (Fig. 1B). Participants performed the correct production type (vocalized vs. imagined) in 97.98% of the trials. In case of vocalized trials, the correct vowel (/u/ vs. /ə/) was performed in 100% of the trials. We checked vocalized trials for their onset latency (possible for 37 sessions). In 98.48% of the trials, the vocal onset was after the go cue, as instructed. The mean vocal latency of these trials was 0.58 s (±0.12 SD).
Fig. 1.
Rule-based vocalization task. Participants imagined or vocalized different contents (phonemes /u/ and /ə/). (A) Production and the content were instructed successively with visual cues, according to the respective rule of the trial block. (B) Rules for four trial blocks per recording session. In two blocks, the content was instructed first; in the other two blocks, the production was instructed first.
For each subject and recording session, we computed neural information about content and production. We applied cross-validated multivariate analysis of variance (cvMANOVA) on preprocessed single-trial MEG data from all sensors (see Methods; see SI Appendix, Fig. S1 for example single-trial data) (31, 32). The resulting measure of neural information can be interpreted analogously to classifier performance from multivariate decoding analyses. To enable robust statistical tests, we averaged information in the time windows between cues 1 and 2 (delay 1), as well as between cue 2 and the go cue (delay 2).
We observed significant neural information about both variables (Fig. 2 and SI Appendix, Fig. S2 with individual data points). For both orders of cue presentation, we found information about the variables shortly after their respective instruction. When content was instructed first, there was significant content information in delay 1 (P = 0.003; corrected) and significant information about content and production in delay 2 (content: P = 3.4 × 10−4, production: P = 1.6 × 10−9; corrected). Conversely, when production was instructed first, content information was only present in delay 2, whereas production information was highly significant in both delays (content: pdel2 = 7.8 × 10−5; production: pdel1 = 2.8 × 10−6, pdel2 = 8.5 × 10−9; corrected). Thus, both the content of a vocalization and its production form were represented neuronally, several seconds before the actual execution.
Fig. 2.
Neural information about content and production. (A) Information in trials with content instructed first. (B) Information in trials with production instructed first. Shaded regions and error bars indicate SEM. Bar plots show average information in delays 1 and 2. Asterisks indicate significance (n = 24, P < 0.05 corrected; t test, one tailed).

The Components of Vocalization Are Modulated by Effort.

Both experimental dimensions entailed differences in motor effort. Imagined vowels lacked actual vocalization, just as /ə/, as a nonarticulated vowel, lacked the strong articulation of /u/. Therefore, we wanted to investigate whether the neural information about content was equally strong for both production types and whether the neural information about production type was identical for both vowels. To test this, we decoded content separately for both production types and production separately for both vowels (Fig. 3 and SI Appendix, Fig. S3 with individual data points).
Fig. 3.
Neural information about content and production in split conditions. (A) Production information in both orders. (B) Content information in both orders. Shaded regions and error bars indicate SEM. Bar plots show averaged information of delays 1 and 2. Asterisks above individual bars indicate significant information (n = 24, P < 0.05 corrected; t test, one tailed). Horizontal lines with asterisks on top indicate a significant difference (n = 24, P < 0.05 corrected; paired t test, two tailed).
In all conditions, information about both variables could be decoded after their respective instruction. When content was instructed first, production information was significant for both vowels, in delay 2 (/ə/: P = 7.8 × 10−9, /u/: P = 6 × 10−9; corrected). When production was instructed first, it could be read out in both delays and for both vowels (/ə/: pdel1 = 4.1 × 10−6, pdel2 = 4.8 × 10−7; /u/: pdel1 = 1.4 × 10−4, pdel2 = 1.6 × 10−9; corrected). In the second delay, production information was higher in /u/ than in /ə/ trials, but only significantly so when content was instructed first (P = 0.026, paired t test; corrected).
Content information was significant in both delays and production types when content was instructed first (imagined: pdel1 = 0.044, pdel2 = 0.044; vocalized: pdel1 = 0.004, pdel2 = 2.7 × 10−5; corrected). When production was instructed first, content information was only significant in the second delay, again in both production types (imagined: P = 0.007, vocalized: P = 1.5 × 10−4; corrected). Content information was higher during vocalized than imagined trials, in both orders and all relevant delays. This difference was significant in the second delay when content was instructed first (P = 0.007, paired t test; corrected).
In sum, both production and content information were present in all individual conditions and, furthermore, higher in those conditions with stronger motor involvement.

The Components of Vocalization Are Represented in Cortical Speech Areas.

We found that vocalization content and production were represented in cortical areas typically associated with speech. To characterize the cortical distribution of content and production information, we repeated the cvMANOVA analysis on the source level using a searchlight approach. We then averaged neural information within four 500 ms time windows per delay. This analysis revealed spatially stable representations of both variables (Fig. 4) with similar and broad frontocentral distributions that included M1, SMA, and Broca’s area (see SI Appendix, Fig. S4 for information in all areas according to the automated anatomical labeling (AAL) atlas).
Fig. 4.
Spatial dynamics of neural information about content production. (A) Production information in both orders. (B) Content information in both orders. Left, right, and top views of the cortical distribution of information. Information was averaged in 500-ms intervals, except for the intervals directly after the cues, for which the first 250 ms was excluded. (C and D) Averaged information over all three delays where information was available for the respective variable. Information strength is color coded, as indicated by the color bar (white: zero or very low information, red: high information).
For higher-order language processes, neural activity is known to be left-lateralized, which is debated for speech on the sublexical level (28, 33). Our source patterns suggested a clear left lateralization. We tested this by computing a lateralization index (LI) for both variables (Fig. 5 and SI Appendix, Fig. S5 with individual data points). Both variables were left-lateralized in all delays after the respective instruction. This was significant for content information in the first delay when it was instructed first (P = 0.037; corrected) and in the second delay when production was instructed first (P = 0.006; corrected). Production information was significantly left-lateralized in both delays when production was instructed first (pdel1 = 0.048, pdel2 = 0.009; corrected) and in the second delay when it was instructed second (P = 0.003; corrected).
Fig. 5.
LI of content and production information. The LI (left – right hemisphere information) was computed for both delays and both orders. Asterisks indicate significance (n = 24, P < 0.05 corrected; t test, one tailed).
Taken together, source analysis showed that both content and production had stable left-lateralized representations on the cortical level. This did not only identify the origin of neural information within well-known speech-associated areas but also excluded confounds due to inherently nonlateralized effects of visual cues or electromagnetic artifacts.

The Components of Vocalization Have Different Representational Formats.

The searchlight analysis indicated spatial stability of the coarse cortical distribution of content and production information across time. Nonetheless, the representational format, i.e., the fine-grained cortical pattern underlying each representation, may be dynamic. To test whether neural representations of content and production transformed across time, we cross-decoded both variables on the sensor level across time (34) (Fig. 6). Stable or dynamic representations would yield high or low cross-time decoding, respectively. We performed this analysis both on the content-first condition and on the production-first condition, as well as training on all timepoints of one condition and testing on those of the other (Fig. 6, mixed orders).
Fig. 6.
Cross-time information about content and production. (A) Cross-time production information. (B) Cross-time content information. Bar plots show the averaged cross-temporal information for delays 1, 2, and 1/2, as well as the expected values for stable representations. The diagonals ± 100 ms were excluded from calculations. Dashed squares indicate the averaged time windows. Asterisks indicate significant information (n = 24, P &lt; 0.05 corrected; t test, one tailed), and horizontal bars with asterisks on top indicate significantly smaller cross-information than the expected value (n = 24, P &lt; 0.05 corrected; paired t test, one tailed).
For statistical testing, we averaged neural cross-information per variable and delay such that neural information about content or production was in principle accessible at all training and test timepoints included in the statistical analysis (Fig. 6, dashed squares and bar plots). Within both delays 1 and 2, there was significant cross-temporal information about both variables (production: pdel1 = 6.3 × 10−5, pdel2 = 2.9 × 10−9; content: pdel1 = 0.027, pdel2 = 1.1 × 10−5; corrected). There was also significant cross-temporal information between delays 1 and 2 for both variables (production: P = 6.3 × 10−5; content: P = 0.012; corrected).
To test whether representations significantly differed between time points, we computed an estimate of the expected cross-time-information in case of perfectly stable representational formats but potentially different information magnitudes (32). The cross-temporal stability of production representations was lower than expected in all time windows and significantly so between delay 1 and 2 and within delay 2 (pdel1/2 = 2 × 10−4, pdel2 = 7.3 × 10−6; corrected). In contrast, cross-temporal content information was never significantly smaller than expected for a temporally stable representation. Thus, while we found evidence for a partially dynamic representation of production type, this was not the case for content, which appeared stable over time.
To what extent are the neural representations of content and production similar? Our cross-time decoding analysis showed that both representations evolve differently across time. We took this as an indication that these representations are not identical. To rigorously estimate the extent of representational overlap, we implemented a cross-variable analysis, training the algorithm on one variable and testing it on the other. We computed cross-variable information in both cue orders and in mixed cue orders where the relevant cue was either first or second. Again, we compared the observed cross-information to the cross-information expected under the assumption of identical representations (Fig. 7).
Fig. 7.
Cross-variable information of content and production. Cross-information in both orders and mixed orders with the relevant cue first or second. (A) Cross-information in trials with content instructed first. (B) Cross-information across trials with relevant cue first. (C) Cross-information across trials with relevant cue second. (D) Cross-information in trials with production instructed first. Solid lines indicate cross-information, and dashed lines indicate the respective expected values for identical representations. Bar plots show the averaged cross-information and the averaged expected values. Asterisks indicate significant cross-information (n = 24, P < 0.05 corrected; t test, one tailed), and horizontal bars with asterisks on top indicate significantly smaller cross-information than the expected value (n = 24, P < 0.05 corrected; paired t test, one tailed).
We found significant cross-information only in delay 1 in the order with the relevant cue first (P = 0.046; corrected). In the other orders, as information about both variables could only be present after the second cue, no cross-information was expected in delay 1. While we observed a small amount of cross-information in delay 2 of the mixed orders, this was not significant. On the other hand, cross-information was significantly smaller than its expected value in the second delay of all orders (content first: P = 0.004; both first: P = 0.02; both second: P = 0.01; production first: P = 0.002; corrected). Taken together, these results show that content and production representations overlap but are not identical. In our data, isolated content representations (before knowledge of the production type) cannot be distinguished from isolated production representations (before knowledge of the content). Thus, content and production representations are indistinguishable as long as the respective other variable is unknown to the participant. However, as soon as both aspects become available, their representations show strong differences.

Information about the Components of Vocalization Remains Stable across Sessions.

Electromagnetic artifacts and other, usually visually driven confounders often pollute the data in speech studies (35). Our source-level analysis provided evidence that content and production information were indeed speech related, originating from well-known speech-associated areas. In addition, we implemented a control analysis to test whether a possible visual cue confound had an impact on our results. Because the order of trial blocks was reversed in each participant’s second recording session, it was possible to cancel out the visual cue effect by decoding across sessions. To this end, we trained the cvMANOVA on one session and tested it on the other (Fig. 8).
Fig. 8.
Cross-session decoding of content and production. Information was averaged for both orders and delays. Asterisks indicate significant information (n = 24, P < 0.05 corrected; t test, one tailed).
Decoding across sessions was expected to be more challenging than decoding within a session, as the signal-to-noise ratio was impacted by additional variability due to head movement between the sessions. Nevertheless, we found significant information about both content and production in all relevant delays (Fig. 8). There was significant content information in both delays when it was instructed first (pdel1 = 0.004, pdel2 = 5.1 × 10−5; corrected) and in delay 2 when it was instructed second (P = 0.006; corrected). Production information was significant in delay 2 when content was instructed first (P = 3.4 × 10−8; corrected) and in both delays when production was instructed first (pdel1 = 4.8 × 10−5, pdel2 = 3.4 × 10−6; corrected). Thus, content and production information were stable across recording sessions and therefore also not driven by a visual, cue-related confound due to the sequential order of rule blocks within each session.

Discussion

Our results shed light on the neural mechanisms underlying the flexible mapping between the content and motor production of human speech. Combining MEG, MVPA, and a factorial task design allowed us to dissociate content and production in the pre-execution phase, where key processing stages take place (7, 3638) and electromagnetic artifacts of the motor production itself are ruled out (35).
Significant content information directly after the first cue suggests that content can be represented independently of a specific motor plan, which falls in line with implications from previous studies on the lexical and the sublexical syllable level (1117, 1924, 30). Therefore, the actual motor production is not necessary to form a neural representation of content even on the phoneme level. However, this does not imply that the content representation is completely independent of motor planning.
Content information was higher when vocalized, and production information was higher for the vowel /u/, implying a dependence of information strength on the degree of motor involvement. As previous studies have shown, overt and covert speech differ in terms of executive motor control, including M1 recruitment for the efference copy (15, 18, 23), which could account for the higher content information in vocalized trials. In addition, the stronger phonological code retrieval and encoding in overt speech could also contribute to the observed difference of content information between the production types (18). The two vowels also differ in motor involvement, as /ə/ is a nonarticulated innate-like vowel, whereas /u/ is strongly articulated and learned (39). Therefore, the stronger motor involvement could account for higher production information in the vowel /u/.
We located representations of content and production in the frontal and central cortex consistent with well-known speech-associated areas (27). The extension to temporal cortices may be due to efference copies in sensory regions (2023, 29). Although one may not expect the higher-order language network to be recruited in our paradigm (33), we found neural representations of both content and production to be stronger in the left hemisphere. This could either indicate that language capacities beyond low-level speech were recruited or that low-level speech processes can already be lateralized under specific circumstances. Independently of these alternatives, our finding of lateralization in an early pre-execution phase suggests that the uncovered neural representations were indeed speech or language specific and did not reflect general working memory processes.
Our cross-decoding approach provided insights into the temporal dynamics and similarities between the neural representations of vocalization components. Our results suggest an overlap of neural representations of content and production during the first delay, when information about only one of the components was available to the subjects. One possible explanation for such an overlap could be an effort effect where conditions with a higher degree of motor involvement elicit higher neural activity than those with a lesser degree. Concretely, |u vs. ə| and |vocalized vs. imagined| could both correspond to contrasts of |high effort vs. low effort|. This effect could reflect priming motor signals preceding vocalizations (7, 40) and falls in line with studies finding stronger motor-related activations in overt than covert speech (15, 18). Alternatively, it could also reflect the firing patterns of one or more speech-specific neural populations that encode several content- and production-related features. Here, effort could drive either the firing rates of individual neurons or the number of recruited cells. Further invasive research is required to determine whether the same population or spatially close and therefore indistinguishable neurons are modulated by both content- and production-related effort.
In the second delay, cross-information between content and production was similar as in the first delay. However, as both content and production information were much higher, the representations were now clearly distinguishable. Moreover, cross-temporal decoding between both delays revealed that the content representation remained stable over time, whereas the production representation transformed once the content was known. Consequently, the divergence of the representations in the second delay was likely driven by the transformation of production. Taken together, this implies that the production representation during the second delay was a combination of the initial format during the first delay and an additional component. This additional component, building up once the content was known, may reflect the specific motor program used to prepare the articulation of the respective vowel, which complies with a description of distinct phoneme representations in the SMA (28) and syllable representations in the SMA, PMC, and M1 (2) in overt speech. In sum, our results show that, when isolated, both representations overlap and correlate with the degree of motor involvement. This overlap of the representations could be caused by premotor-like activity that is modulated by effort but independent of the actual execution. While the content representation remains stable, the production representation changes once the content is known, likely reflecting the addition of a specific motor program.
While natural speech can functionally be decomposed into content and motor production, there is little evidence for a content dimension on the neural level and even less is known about its dynamic interplay with motor production. Most implications for a neural content dimension come from studies focusing on the lexical or the sublexical syllable level (1117, 1924). Temporal dynamics have so far only been studied on the lexical level (3638). Yet, the elementary building blocks of speech are phonemes, and to our knowledge, all previous research on this level is related to motor production (2529). Our results uncover a neural content dimension for phonemes that was present independently of motor production and could therefore allow for a generalization between production forms. These results accord well with and expand the larger body of work on the higher sublexical and lexical levels.
Our findings set the stage for future research to investigate how neural codes of isolated phonemes and their motor production translate to those embedded in speech. For this, our combined approach of MEG, MVPA, and a rule-based paradigm provides a fruitful framework, thus opening a window for noninvasive speech research in health and disease.

Methods

Subjects.

Twenty-four healthy humans with normal or corrected-to-normal vision participated in the study (14 male; 21 right handed; mean age: 29 y; 5 y SD). All participants gave written informed consent before participation and received a monetary reward afterward. The study was conducted in accordance with the Declaration of Helsinki and approved by the ethics committee of the University of Tübingen.

Behavioral Task and Stimuli.

Participants performed a rule-based vocalization task. In each trial, one of two vowels (/u/ or /ə/) had to be either overtly or covertly vocalized. Vowel and production type were instructed sequentially by visual cues. The corresponding rule, i.e., the assignment of the visual cues to their instructed content, changed across recording blocks and was indicated before the beginning of each block.
Participants self-paced the trials using closed-loop eye movement control. Each trial started with an initiation phase of 1,000 ms, during which a white fixation spot (diameter: 0.1° of the visual angle) appeared at the center of the screen. Once fixation was acquired, the first visual cue (a forward or backward white slash, length: 2°, width: 0.25° of the visual angle) appeared for 100 ms, instructing either content or production. Then, the fixation spot appeared again for a delay period of 2,000 ms. The second visual cue appeared for 100 ms, instructing the respective missing variable, followed by a second delay period displaying the fixation spot. Dimming the spot for 100 ms served as the go cue for the participants’ response. After the go cue, the fixation spot remained on screen for 1,500 ms, which gave time for the response of the participants. The intertrial interval was 1,000 ms long, indicated by dimming of the fixation spot. If fixation was broken after the onset of the first visual cue, the trial was aborted, which was indicated by a color change of the fixation spot to red for 500 ms. The cue configuration of the aborted trial was repeated at a random position later within this block.
Within each recording block, the order of instruction and the meaning of each instruction cue were fixed. However, in half of the blocks, the content was instructed first, whereas in the other half of the blocks, the production was instructed first. Moreover, the assignment of each visual cue to its meaning (content or production) was different in half of the blocks for each order. These four different rules led to four blocks of trials. Each block contained 80 trials, with 20 per condition (1: /u/ vocalized, 2: /u/ imagined, 3: /ə/ vocalized, and 4: /ə/ imagined). The order of the conditions was randomized per block. In total, there were 16 different conditions, including the different assignments of the visual cues to their instructed content. The 24 possible orders of the four blocks were randomly assigned to the participants.
Before the experiment, participants were instructed on how to articulate the vowels correctly. Thereby, we made sure that /u/ was clearly articulated, whereas /ə/ was produced with minimal involvement of the vocal tract. We also made sure that there was no involuntary articulatory movement visible for imagined vowels. During the experiment, the participants memorized the respective rule before each block. The rule was presented on the screen for as long as necessary. Eight training trials preceded each block to ensure correct performance. If necessary, the rule was shown again during the block while the sequence of trials was paused. All participants performed two MEG sessions with 320 trials each. For each participant, the order of blocks from the first session was inversed for the second session. During both sessions, the performance of the participants was monitored with a microphone and a camera. After the experiments, each trial was labeled for production type and in case of vocalized trials for vowels.

Data Acquisition.

We recorded MEG (Omega 2000, CTF Systems Inc.) with 273 sensors at a sampling rate of 2,342.75 Hz in a magnetically shielded chamber. Participants sat upright with a screen at a 65-cm viewing distance. Stimuli were projected onto the screen by an LCD projector (Sanyo PLC-XP41, Moriguchi, Japan) with a refresh rate of 60 Hz. The projection was the only source of light in the chamber. Continuous head movement was monitored with three coils attached to fiducial points. In two participants, head movement could not be measured due to technical issues. Eye movements were recorded using an infrared eye tracker (EyeLink CL-OC, SR Research Ltd.) at a sampling rate of 1,000 Hz. For labeling of trials and vocal onset detection in vocalized trials, the participants’ responses were recorded with a microphone integrated into the MEG System, at a sampling rate of 2,343.75 Hz. An additional microphone (MD 419, Sennheiser electronic GmbH & Co KG) with a sampling rate of 44.1 kHz was used to record a higher-quality audio trace for the labeling of production types and vowels. In a separate session, we acquired structural T1-weighted MRIs (3 Tesla MAGNETOM, Siemens Healthcare GmbH) for source reconstruction based on each participant’s individual’s anatomy (resolution: 1 mm3, MPRAGE).

Data Preprocessing.

Technically caused channel jumps were detected and corrected and time lags between digital triggers and actual stimulus presentation were corrected based on a photodiode signal. For one subject, we excluded a noisy channel from the analysis. We low-pass filtered the MEG data at 30 Hz (sixth order, zero-phase Butterworth infinite impulse response (IIR) filter), downsampled to 300 Hz and low-pass filtered the data, again, at 10 Hz. Each trial was baseline corrected using the 500 ms preceding the onset of the first visual cue. To reduce electromagnetic artifacts, we ran independent component analysis (ICA) on the data. To ensure convergence of the ICA algorithm, the data were high-pass filtered at 0.05 Hz. However, we applied the resulting unmixing matrix to the original data without any high-pass filter and removed artifact components like heartbeats and other small muscle activities. Vocal-onset detection in vocalized trials was possible for 37 sessions. Audio traces were missing due to technical issues in the remaining session. However, performance was closely monitored during the recordings and immediately corrected, if participants vocalized before the go cue appeared. The available audio traces were smoothed with a median-based sliding window model (window size 42.66 ms). A participant-wise threshold served for onset detection. For that, the rms of a 1,000 ms period of one of the imagined trials was calculated. The respective SD was multiplied by eight and added to the rms. In case of low vocalization amplitudes due to very soft voices of a few participants, the threshold was manually adjusted by slightly decreasing it until the onsets could reliably be detected. The vocalization was correctly performed, with an onset after the go cue, in 99% of the trials.
For all subsequent analyses, only correct trials were used. Those were trials with the correct production type and, in case of vocalized trials, with the correct vowel and a vocal onset after the go cue. For the nine sessions with a missing audio trace, only the correct vowel was considered.

Cross-Validated MANOVA.

We estimated the amount of neural information about the variables of interest in the MEG data with cross-validated MANOVA (31, 32). As an extension of the commonly used cross-validated Mahalanobis distance, cvMANOVA allows for the simultaneous quantification of the variability in neural data due to several variables of interest. We performed 20 repetitions of cvMANOVA for each session with fivefold cross-validation. All folds and repetitions were subsequently averaged. We first estimated a noise covariance matrix using trials from all conditions. Next, we estimated contrasts of beta weights of each condition in a cross-validation fold’s training set, which accounted for the “training” of the model. The “testing” was done by estimating contrasts of beta weights in the respective fold’s test set. The dot product of these contrasts, normalized by the noise covariance, served as an estimate of the true pattern distinctness:
D=trace1nBtrainCtrainCtrain-1XtestXtestCtrainCtest-1BtestΣ-1,
where Σ−1 is the inverted noise covariance matrix, Ctrain is the contrast vector the model is trained on, Ctest is the test contrast vector, and Xtest is the design matrix indicating the unique condition of each trial in the test set. Btrain and Btest contain the regression parameters of a multivariate general linear model:
Btrain=Xtrain-1Ytrain,
Btest=Xtest-1Ytest,
where Ytrain and Ytest are the training and test data sets. The inverted noise covariance matrix was estimated with the mean of the time window from cue 1 offset to go cue onset:
Btraintw=Xtrain-1Ytraintw,
Ξ=Ytraintw-XtrainBtraintw,
Σ-1=fE-p-1·Ξ'Ξ-1,
with fE as the degrees of freedom and p as the number of sources.
Technically, cvMANOVA is a multivariate information-based cross-validated encoding approach. However, it shares many similarities with common multivariate decoding methods (41). The measure of neural information about the variables of interest can, theoretically, also be used to decode these variables on individual trials. Therefore, we refer to our results as decoding results.
For all sensor-level analyses, a subset of 137 approximately equally distributed sensors were included. This was to ensure a sufficient number of trials in relation to the degrees of freedom of the dataset.

Cross-Decoding.

With cvMANOVA, we were able to decode across conditions by training and testing on different time points, variables, and levels of the variables. Therefore, the contrast vectors Ctrain and Ctest were constructed to only contain the respective conditions to be trained or tested on. With this approach, we decoded across variables by using content (/u/ vs. /ə/) for the construction of the training contrast Ctrain and production type (vocalized vs. imagined) for the construction of the test contrast Ctest. To estimate content information for both production types separately, we constructed Ctrain based on trials including both production types, but Ctest based on trials with only one production type, respectively. The same principle was applied for separately decoding the production type from both vowels. For decoding across time, we used the regression parameters Btrain from one time point and Btest from another. We applied this to all pairs of time points. For cross-session decoding, we implemented a twofold cross-validation such that trials from session 1 and 2 served as training and test sets alternately.

Expected Cross-Decoding.

To estimate a benchmark for the overlap between representations, we computed the expected cross-information (32). As the maximally possible amount of shared information between contexts depends on the information available in each individual context, the strength of the shared representation must be compared to the strength of both representations. Therefore, we estimated the expected cross-decoding
E12=D1D2·signD1·signD2,
where D1 and D2 denote the pattern distinctness in the two contexts. If the representations were identical, the cross-decoding D12 would approach E12. Conclusively, cross-decoding values smaller than E12 indicate that the representations are not identical and, therefore, not fully overlapping.

Source Estimation.

We generated individual single-shell head models (42) based on each subject’s structural T1-weighted MRI. Using linear spatial filtering (43), we estimated three-dimensional MEG source activity at 457 equally spaced locations ~7 mm beneath the skull. For searchlight analysis, we used the three dipole directions of each source and the respective immediate neighbors. The LI was computed by averaging the searchlight results for each hemisphere and subtracting right from left. Cortical areas were mapped according to the AAL atlas (44), and neural information was averaged for each area of the left hemisphere (SI Appendix, Fig. S4). For cross-session decoding, the three orientations were added, and a subset of 229 equally spaced sources was used for decoding.

Statistical Analysis.

Neural information was averaged within two different time windows (delay 1: from 250 ms after cue 1 offset to cue 2 onset; delay 2: from 250 ms after cue 2 offset to go cue onset). Cross-temporal information within delay 1 was averaged over all conditions in which the relevant cue was instructed first, as information could only be present in these conditions. To estimate cross-temporal generalization between delays, we averaged an off-diagonal time window over those conditions where the relevant cue was presented first, as well as those mixed-order conditions where the relevant cue was presented first in one condition but second in the other. To estimate cross-temporal information within delay 2, we averaged data from all cue orders. For testing the significance of neural information and cross-information to be larger than 0, we employed one-tailed one-sample t tests. One-tailed paired t tests were applied for testing cross-time and cross-variable information to be smaller than the expected cross-information. For the comparison of content information in both production types and vice versa, we used two-tailed paired t tests. All P-values were false discovery rate (FDR) corrected for the number of tested time intervals (45).

Potential Cue Confound.

We identified the source of a potential confound: due to the blocked design, in combination with possible nonstationarities in the dataset, content and production information could theoretically be influenced by representations of the physical cue itself, even though the cue appearance and its meaning were counterbalanced. Briefly, if noise led to independent shifts of the neural activity patterns of both cue options, this could mistakenly be identified as information about the variable indicated by the cue. To make sure that our results were not dependent on this potential confound, we took two measures. First, we excluded the 250 ms after cue onset in both time windows used for statistical analysis, as this was the time period that would likely be affected. Second, we performed cross-session decoding to confirm that content and production information were present if the possible cue confound was accounted for.

Visualization.

For all line plots, data were smoothed with a 100-ms Hanning window (full width at half maximum).

Software.

All analyses were performed using the Fieldtrip toolbox (46) and custom code in MATLAB.

Data, Materials, and Software Availability

Preprocessed MEG data and analysis code to reproduce all reported results are publicly available at https://osf.io/5c43h/ (47).

Acknowledgments

We thank Gabi Walker-Dietrich, Jürgen Dax, and Christoph Braun for their help with data acquisition. This research was supported by the European Research Council Starting Grant 335880 (M.S.) and Consolidator Grant 864491 (M.S.) and by the Evangelisches Studienwerk Villigst (V.A.V.).

Author contributions

V.A.V., F.S., D.J.H., S.R.H., and M.S. designed research; V.A.V. performed research; F.S., D.J.H., and M.S. contributed new reagents/analytic tools; V.A.V. and F.S. analyzed data; F.S. and M.S. supervised research; V.A.V. and M.S. acquired funding; and V.A.V., F.S., and M.S. wrote the paper.

Competing interests

The authors declare no competing interest.

Supporting Information

Appendix 01 (PDF)

References

1
M. S. Vitevitch, The influence of sublexical and lexical representations on the processing of spoken words in English. Clin. Linguist. Phon. 17, 487–499 (2003).
2
J. W. Bohland, F. H. Guenther, An fMRI investigation of syllable sequence production. NeuroImage 32, 821–841 (2006).
3
S. B. Eickhoff, S. Heim, K. Zilles, K. Amunts, A systems perspective on the effective connectivity of overt speech production. Philos. Trans. R. Soc. Math. Phys. Eng. Sci. 367, 2399–2421 (2009).
4
S. R. Hage, A. Nieder, Dual neural network model for the evolution of speech and language. Trends Neurosci. 39, 813–829 (2016).
5
U. Jürgens, Neural pathways underlying vocal control. Neurosci. Biobehav. Rev. 26, 235–258 (2002).
6
C. J. Price, A review and synthesis of the first 20years of PET and fMRI studies of heard speech, spoken language and reading. NeuroImage 62, 816–847 (2012).
7
G. M. Schulz, M. Varga, K. Jeffires, C. L. Ludlow, A. R. Braun, Functional neuroanatomy of human vocalization: An H215O PET study. Cereb. Cortex 15, 1835–1847 (2005).
8
P. Sörös et al., Clustered functional MRI of overt speech production. NeuroImage 32, 376–387 (2006).
9
P. Indefrey, W. J. M. Levelt, The spatial and temporal signatures of word production components. Cognition 92, 101–144 (2004).
10
L. Pylkkänen, D. K. Bemis, E. Blanco Elorrieta, Building phrases in language production: An MEG study of simple composition. Cognition 133, 371–384 (2014).
11
S. Basho, E. D. Palmer, M. A. Rubio, B. Wulfeck, R.-A. Müller, Effects of generation mode in fMRI adaptations of semantic fluency: Paced production and overt speech. Neuropsychologia 45, 1697–1706 (2007).
12
M. Corley, P. H. Brocklehurst, H. S. Moat, Error biases in inner and overt speech: Evidence from tongue twisters. J. Exp. Psychol. Learn. Mem. Cogn. 37, 162–175 (2011).
13
J. Huang, T. H. Carr, Y. Cao, Comparing cortical activations for silent and overt speech using event-related fMRI. Hum. Brain Mapp. 15, 39–53 (2002).
14
P. K. McGuire et al., Functional anatomy of inner speech and auditory verbal imagery. Psychol. Med. 26, 29–38 (1996).
15
E. D. Palmer et al., An event-related fMRI study of overt and covert word stem completion. NeuroImage 14, 182–193 (2001).
16
S. Partovi et al., Effects of covert and overt paradigms in clinical language fMRI. Acad. Radiol. 19, 518–525 (2012).
17
S. S. Shergill et al., A functional study of auditory verbal imagery. Psychol. Med. 31, 241–253 (2001).
18
F. Stephan, H. Saalbach, S. Rossi, The brain differentially prepares inner and overt speech production: Electrophysiological and vascular evidence. Brain Sci. 10, 148 (2020).
19
P. C. Trettenbrein, G. Papitto, A. D. Friederici, E. Zaccarella, Functional neuroanatomy of language without speech: An ALE meta-analysis of sign language. Hum. Brain Mapp. 42, 699–712 (2021).
20
J. F. Houde, S. S. Nagarajan, K. Sekihara, M. M. Merzenich, Modulation of the auditory cortex during speech: An MEG study. J. Cogn. Neurosci. 14, 1125–1138 (2002).
21
S. S. Shergill et al., Modulation of activity in temporal cortex during generation of inner speech. Hum. Brain Mapp. 16, 219–227 (2002).
22
X. Tian, Mental imagery of speech and movement implicates the dynamics of internal forward models. Front. Psychol. 1, 166 (2010).
23
T. J. Whitford et al., Neurophysiological evidence of efference copies to inner speech. eLife 6, e28197 (2017).
24
W. Zhang, Y. Liu, X. Wang, X. Tian, The dynamic and task-dependent representational transformation between the motor and sensory systems during speech production. Cogn. Neurosci. 11, 194–204 (2020).
25
D. F. Conant, K. E. Bouchard, M. K. Leonard, E. F. Chang, Human sensorimotor cortex control of directly measured vocal tract movements during vowel production. J. Neurosci. 38, 2955–2966 (2018).
26
J. M. Correia, C. Caballero-Gaudes, S. Guediche, M. Carreiras, Phonatory and articulatory representations of speech production in cortical and subcortical fMRI responses. Sci. Rep. 10, 4529 (2020).
27
M. Papoutsi et al., From phonemes to articulatory codes: An fMRI study of the role of Broca’s area in speech production. Cereb. Cortex 19, 2156–2165 (2009).
28
M. G. Peeva et al., Distinct representations of phonemes, syllables, and supra-syllabic sequences in the speech production network. NeuroImage 50, 626–638 (2010).
29
T. H. Heinks-Maldonado, D. H. Mathalon, M. Gray, J. M. Ford, Fine-tuning of auditory cortex during speech production. Psychophysiology 42, 180–190 (2005).
30
G. M. Oppenheim, G. S. Dell, Motor movement matters: The flexible abstractness of inner speech. Mem. Cognit. 38, 1147–1160 (2010).
31
C. Allefeld, J.-D. Haynes, Searchlight-based multi-voxel pattern analysis of fMRI by cross-validated MANOVA. NeuroImage 89, 345–357 (2014).
32
F. Sandhaeger, N. Omejc, A.-A. Pape, M. Siegel, Abstract neuronal choice signals during embodied decisions. bioRxiv [Preprint] (2020). https://doi.org/10.1101/2020.10.02.323832 (Accessed 2 October 2020).
33
G. B. Cogan et al., Sensory–motor transformations for speech occur bilaterally. Nature 507, 94–98 (2014).
34
J.-R. King, S. Dehaene, Characterizing the dynamics of mental representations: The temporal generalization method. Trends Cogn. Sci. 18, 203–210 (2014).
35
A. Ewald, S. Aristei, G. Nolte, R. A. Rahman, Brain oscillations and functional connectivity during overt language production. Front. Psychol. 3, 166 (2012).
36
A. Flinker et al., Redefining the role of Broca’s area in speech. Proc. Natl. Acad. Sci. U.S.A. 112, 2871–2875 (2015).
37
M. A. Long et al., Functional segregation of cortical regions underlying speech timing and articulation. Neuron 89, 1187–1193 (2016).
38
R. Salmelin, A. Schnitzler, F. Schmitz, H.-J. Freund, Single word reading in developmental stutterers and fluent speakers. Brain 123, 1184–1202 (2000).
39
O. C. Irwin, Infant speech: Development of vowel sounds. J. Speech Hear. Disord. 13, 31–34 (1948).
40
N. Gavrilov, S. R. Hage, A. Nieder, Functional specialization of the primate frontal lobe during cognitive control of vocalizations. Cell Rep. 21, 2393–2406 (2017).
41
M. N. Hebart, C. I. Baker, Deconstructing multivariate decoding for the study of brain function. NeuroImage 180, 4–18 (2018).
42
G. Nolte, The magnetic lead field theorem in the quasi-static approximation and its use for magnetoencephalography forward calculation in realistic volume conductors. Phys. Med. Biol. 48, 3637–3652 (2003).
43
B. D. Van Veen, W. Van Drongelen, M. Yuchtman, A. Suzuki, Localization of brain electrical activity via linearly constrained minimum variance spatial filtering. IEEE Trans. Biomed. Eng. 44, 867–880 (1997).
44
N. Tzourio-Mazoyer et al., Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain. NeuroImage 15, 273–289 (2002).
45
Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 57, 289–300 (1995).
46
R. Oostenveld, P. Fries, E. Maris, J.-M. Schoffelen, FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput. Intell. Neurosci. 2011, 1–9 (2011).
47
V. A. Voigtlaender, F. Sandhaeger, D. J. Hawellek, M. Siegel, Neural representations of the content and production of human vocalization. OSF. https://osf.io/5c43h/. Deposited 12 May 2023.

Information & Authors

Information

Published in

The cover image for PNAS Vol.120; No.23
Proceedings of the National Academy of Sciences
Vol. 120 | No. 23
June 6, 2023
PubMed: 37253014

Classifications

Data, Materials, and Software Availability

Preprocessed MEG data and analysis code to reproduce all reported results are publicly available at https://osf.io/5c43h/ (47).

Submission history

Received: November 11, 2022
Accepted: May 10, 2023
Published online: May 30, 2023
Published in issue: June 6, 2023

Keywords

  1. human speech
  2. vocalization
  3. MEG
  4. MVPA
  5. neural information

Acknowledgments

We thank Gabi Walker-Dietrich, Jürgen Dax, and Christoph Braun for their help with data acquisition. This research was supported by the European Research Council Starting Grant 335880 (M.S.) and Consolidator Grant 864491 (M.S.) and by the Evangelisches Studienwerk Villigst (V.A.V.).
Author contributions
V.A.V., F.S., D.J.H., S.R.H., and M.S. designed research; V.A.V. performed research; F.S., D.J.H., and M.S. contributed new reagents/analytic tools; V.A.V. and F.S. analyzed data; F.S. and M.S. supervised research; V.A.V. and M.S. acquired funding; and V.A.V., F.S., and M.S. wrote the paper.
Competing interests
The authors declare no competing interest.

Notes

This article is a PNAS Direct Submission.

Authors

Affiliations

Department of Neural Dynamics and Magnetoencephalography, Hertie Institute for Clinical Brain Research, University of Tübingen, 72076 Tübingen, Germany
Centre for Integrative Neuroscience, University of Tübingen, 72076 Tübingen, Germany
Magnetoencephalography (MEG) Center, University of Tübingen, 72076 Tübingen, Germany
Graduate Training Centre of Neuroscience, International Max Planck Research School, University of Tübingen, 72076 Tübingen, Germany
Department of Neural Dynamics and Magnetoencephalography, Hertie Institute for Clinical Brain Research, University of Tübingen, 72076 Tübingen, Germany
Centre for Integrative Neuroscience, University of Tübingen, 72076 Tübingen, Germany
Magnetoencephalography (MEG) Center, University of Tübingen, 72076 Tübingen, Germany
Graduate Training Centre of Neuroscience, International Max Planck Research School, University of Tübingen, 72076 Tübingen, Germany
David J. Hawellek
Department of Neural Dynamics and Magnetoencephalography, Hertie Institute for Clinical Brain Research, University of Tübingen, 72076 Tübingen, Germany
Centre for Integrative Neuroscience, University of Tübingen, 72076 Tübingen, Germany
Magnetoencephalography (MEG) Center, University of Tübingen, 72076 Tübingen, Germany
F. Hoffmann-La Roche, Pharmaceutical Research and Early Development, Roche Innovation Center Basel, 4051 Basel, Switzerland
Centre for Integrative Neuroscience, University of Tübingen, 72076 Tübingen, Germany
Neurobiology of Social Communication, Department of Otolaryngology - Head and Neck Surgery, Hearing Research Centre, University of Tübingen, 72076 Tübingen, Germany
Markus Siegel1 [email protected]
Department of Neural Dynamics and Magnetoencephalography, Hertie Institute for Clinical Brain Research, University of Tübingen, 72076 Tübingen, Germany
Centre for Integrative Neuroscience, University of Tübingen, 72076 Tübingen, Germany
Magnetoencephalography (MEG) Center, University of Tübingen, 72076 Tübingen, Germany

Notes

1
To whom correspondence may be addressed. Email: [email protected] or [email protected].

Metrics & Citations

Metrics

Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.


Citation statements




Altmetrics

Citations

Export the article citation data by selecting a format from the list below and clicking Export.

Cited by

    Loading...

    View Options

    View options

    PDF format

    Download this article as a PDF file

    DOWNLOAD PDF

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Personal login Institutional Login

    Recommend to a librarian

    Recommend PNAS to a Librarian

    Purchase options

    Purchase this article to access the full text.

    Single Article Purchase

    Neural representations of the content and production of human vocalization
    Proceedings of the National Academy of Sciences
    • Vol. 120
    • No. 23

    Figures

    Tables

    Media

    Share

    Share

    Share article link

    Share on social media

    Further reading in this issue