Cognitive control of orofacial motor and vocal responses in the ventrolateral and dorsomedial human frontal cortex

Significance Across primates, a set of ventrolateral frontal (VLF) and dorsomedial frontal (DMF) brain areas are critical for voluntary vocalizations. Determining their individual roles in vocal control and how they might have changed is crucial to understanding how the complex vocal control in human speech emerged during primate brain evolution. The present work demonstrated key functional dissociations in Broca’s region of the VLF (i.e., between dorsal and ventral area 44, and area 45) and in the DMF (i.e., between the presupplementary motor area [pre-SMA] and the midcingulate cortex [MCC]) during the cognitive control of orofacial, nonspeech, and speech vocal responses.

The start of each run was synchronized to the 5th TTL pulse from the MRI scanner following scan initiation. Each run began with a fixation task followed by two motor mapping task blocks involving either: hand and eye movements (visuo-manual session), speech and nonspeech vocal responses (visuo-vocal session) or mouth and tongue movements (visuo-orofacial session). The details of the fixation and motor mapping tasks are described in Loh et al (1). After the three fixation/motor mapping task blocks (total duration = 28.5s), subjects were presented with the six learning task blocks (3 x 2 feedback types) and six control task blocks (3 motor responses x 2 feedback types) that were randomly interleaved in each run. The total length of each run was limited to a maximum of 14.6min, yielding 400 T2*-weighted gradient echo planar EPI volumes (40 descending oblique slices, voxel resolution = 2.7mm x 2.7mm x 2.7mm, TR = 2.2s, TE = 30.0s, flip angle = 90°). The initial five volumes of each acquisition were discarded to avoid confounds of unsteady magnetization. High-resolution T1 structural images (MPRAGE, 0.9mm 3  http://www.nitrc.org/projects/artifact_detect/), motion outliers were computed from the realigned images and realignment parameters. The realignment parameters and detected motion outliers were saved as covariates to model potential nonlinear head motion artefacts in subsequent statistical analyses. Slicetiming correction was applied with the time centre of the volume as reference. The subject-mean functional images were co-registered with the corresponding structural images using mutual information optimization. Functional and structural images were then spatially normalized into standard MNI space.
Experimental tasks. Two main experimental tasks were implemented in the current study: 1) A visuomotor conditional learning task ( Fig. 1A; Movie S3), and 2) a visuo-motor control task ( Fig. 1B; Movie S1 and S2). The sequence of events was comparable for the two tasks: an instruction screen lasting for 2s informed the participant of the type of task to be performed: "Find the correct associations" (which indicated the selection of the response linked to the particular visual cue presented on each trial) during the conditional learning task or "Select the X response" (which indicated the particular response to be performed on each trial) for the control task block, respectively. Following the instructions, a black screen with a central fixation cross was presented during a jittered inter-trial interval of 0.5-8s (mean=3.5s). One of three possible visual stimuli (abstract grayscale images) was then presented on the center of the screen for 2 seconds during which the subject had to select and perform one of three possible motor responses (learning task) or perform an instructed motor response (control task).
Stimulus presentation was ordered as randomly permuted blocks of the three possible images (i.e. abc, bca, acb, etc.). After a jittered delay (0.5-6s, mean=2s), a nonspeech or speech vocal feedback (1s) was provided to inform the subject whether the response selected and performed was the correct one for the presented image (during the learning task) or the correct instructed response (during the control task). After a jittered inter-trial interval (0.5-8s, mean=3.5s), another trial started with the presentation of one of the 3 possible stimuli. In half of the task blocks, speech vocal feedback was provided (positive: "Correct"; negative: "Error"). For the remaining blocks, nonspeech vocal feedback was provided (positive: "Aha"; negative: "Boo"). The subject was informed about the type of feedback that would be provided after the performance of each response by the text color on the instruction screen at the start of each task (Speech Vocal: Yellow, Nonspeech Vocal: Red). The four different types of vocal feedback were recorded by the same male voice and processed using Audacity software.
In each visuo-motor conditional learning task block ( Fig. 1A; Movie S3), the subjects had to acquire three visuo-motor conditional associations (if visual stimulus A, then motor response X, if visual stimulus B, then motor response Y, and if visual stimulus C, then motor response Z) via trial-and-error.
On every trial, one of the three possible stimuli was presented and the subject had to select one of the three possible responses to perform (Response Selection). Auditory vocal feedback was then provided to inform the subject whether the selected response was the correct one for the presented stimulus.
This trial-and-error learning period continued until the subject selected and performed the correct response for each stimulus. When a correct response had been performed once for each of the three visual stimuli, the task proceeded to the "post-learning" period during which the subject had to repeat, in response to the appropriate cues, each of the learnt conditional associations twice (6 trials). Note that in each visuo-motor conditional learning task block, a novel set of 3 stimuli was presented and, therefore, the subject had to learn the new stimulus-response relations.
In each visuo-motor control task block ( Fig. 1C; Movies S1 and S2), the subjects were informed of the specific response to perform on the instruction screen. Subsequently, they had to perform that particular instructed response to all of the three different visual stimuli over five trials. Thus, in contrast to the learning task, during the control task the subjects did not have to select an appropriate motor response to perform based on learning of the correct stimulus-to-response arbitrary relations nor to adjust their selections based on the provided feedback. Different sets of abstract visual images were used in all learning and control task blocks. Note that in each control task block, a novel set of 3 stimuli was used.

Supplementary Results
Behavioral task performance. Three main performance measures were used from the learning and  Fig S1A) and feedback type (χ 2 =0.05, df=1, p=0.83, Fig S1A) on completion rates. This finding indicated that subjects performed equally well for both the control and learning tasks when different responses were involved and when different feedbacks were provided.
As expected, we observed a significant effect of task type (χ 2 =82.2, df=1, p<2x10 -16 , Fig S1A): the completion rates were higher in control versus learning blocks. As a further indication that response type had no influence on conditional-associative learning performance, we found that the mean learning period (number of trials taken to acquire the conditional associations in correctly performed blocks) did not differ across response types (χ 2 =3.05, df=2, p=0.218, Fig S1B).
In terms of response selection RTs, regression analyses revealed no significant effect of the type of feedback provided (χ 2 =2.43, df=1, p=0.12, Fig S1C). There was a significant main effect of response type (χ 2 =2.36x10 3 , df=3, p<2x10 -16 ): subjects were fastest with manual responses, followed by orofacial, and non-speech/speech vocal responses, reflecting the increasing motor complexity from finger presses to mouth movements and vocal productions. Note that RTs in the visuo-nonspeech and visuo-speech vocal conditional learning association tasks were not significantly different (p=1.00). As expected, there was a main effect of trial type on RTs (χ 2 =1.18x10 3 , df=2, p<2x10 -16 ): RTs were faster in control than learning trials, and in learning than post-learning trials. This result reflected the fact that response selection was qualitatively different between the three trial types. Finally, there was a significant interaction (χ 2 =551, df=6, p<2x10 -16 ) between the response modality and trial type on the RTs. Post-hoc analyses (pairwise comparisons using lsmeans package with Bonferroni correction for multiple comparisons) revealed that for all modalities, control trial RTs were consistently faster than post-learning and learning RTs (p<0.0001). This finding was expected since, in control trials, no cognitive selection was involved as opposed to the learning and post-learning trials. Differentiating between the various modalities, learning RTs were significantly faster than post-learning RTs in orofacial (p<0.0001) and manual (p=0.016) response conditions, but learning and post-learning RTs did not differ for nonspeech (p=0.570) and speech (p=0.872) vocal responses. This result indicated that differential cognitive mechanisms could be involved during the learning and performance of manual and orofacial conditional associations versus vocal (speech and non-speech) conditional associations. (5) Table S1). Furthermore, the recruitment of the dorsal area 44, area 45 and pre-SMA during the learning of orofacial and vocal conditional associations are also left-lateralized (See Tables S1 and S3). These findings are generally consistent with existing literature that demonstrated left-dominant recruitment of the ventral prefrontal cortex and the pre-SMA during orofacial and verbal productions (8,9). By extending this body of work, the present results demonstrate that the acquisition and performance of basic visuo-orofacial and visuo-vocal conditional associations are also left-lateralized. It would be interesting to explore whether this left-lateralization of visuoorofacial/vocal conditional associative learning and performance is present also in non-human primates.

The precentral gyrus of the insula is involved in the cognitive selection of manual, orofacial and vocal responses. An investigation by Dronkers
By contrast, activations associated with the processing of vocal feedback during learning in the ventral area 44, as well as the vocal feedback-driven conditional associative learning in the MCC, appeared to be bilateral (Table S3). These results echo previous findings that voice processing recruits the inferior frontal gyrus, bilaterally (10), and that adaptive feedback processing recruits the MCC bilaterally (11).   Table S4-Increased activations in the medial frontal cortex during nonspeech and speech vocal feedback analysis in the visuo-manual, visuo-orofacial, visuo-nonspeech vocal, and visuo-speech vocal conditional learning periods compared to control in hemispheres with and without a pcgs. X, Y, and Z coordinates correspond to the coordinates of the increased activities in the MNI stereotaxic space. Tstatistics are significant at p corrected <0.05.