Diverging neural dynamics for syntactic structure building in naturalistic speaking and listening

Significance Neuroimaging studies of language processing usually focus on language comprehension. This is because language production is affected by increased motion artifacts and is challenging to control experimentally. Sentence production studies typically rely on task designs that impose strong constraints on speaking. Here, we studied the brain responses to syntactic structure building during spontaneous production and naturalistic comprehension. We found brain responses to be sensitive to structure building in both production and comprehension, but with different temporal profiles in each modality. In production, the structure was built early in a sentence in an anticipatory way, while in comprehension structure building followed the input and was thus integratory. These results highlight different dynamics of syntactic structure building during speaking and listening.


Dependency parsing
We extracted dependency parsing as an index to perform chunking of each sentence in relational units.We used the dependency parser provided by the Stanford parser via CoreNLP.We identified the heads of the sentence as the words that have a relation attached to them.For example, in the sentence "He's examining one of the bodies", "examining" is the first head, followed by "one" and by "bodies" (Fig. S1).These three words are the only ones that have a dependency relation attached to them, while the other words are all dependents of one head.The chunked parsing strategy counted the nodes intervening between all heads, to model a less incremental strategy to syntactic structure building, following the idea that speakers plan the structure of a few words at a time (e.g.always planning the structure of the verb at the start of the sentence).

Temporal derivative
Methods.Since the results of production-specific parsers indicated that the LpMTG may have had a later response to top-down operations than BA45, we looked into how the temporal derivatives of the top-down and bottom-up parsers modelled the data.The temporal derivative of the haemodynamic response function (HRF) is usually used in fMRI analysis to account for small differences in the latency of the BOLD response.An increase in the temporal derivatives means that the BOLD response peaks earlier, while a decrease indicates a later peak.We ran the same linear-mixed effects model used before with the addition of temporal derivatives of all predictors of interest (Table S5).
Results.We found a significant three-way interaction between the top-down derivative, modality and ROI (χ 2 = 16.6, p < 0.0003) (Fig. S2); the surprisal derivative, modality and ROI (χ 2 = 16.9, p < 0.0003); and the open nodes derivative, modality and ROI (χ 2 = 10.7,p < 0.005), and a two-way interaction between the sentence offset derivative and ROI (χ 2 = 9.3, p < 0.01).The top-down temporal derivative was significantly positive in BA45 in production (estimate = 0.39, SE = 0.16, p = 0.02), indicating an earlier BOLD peak than assumed by the canonical HRF, while it was significantly negative in LpMTG (estimate = -0.33,SE = 0.17, p < 0.05), indicating a later peak.It was not significantly different from zero in comprehension, nor did it differ between ROIs.These results suggest that the LpMTG may have been active after BA45 in response to more top-down node counts.
Sentence offset elicited later responses in comprehension in the LpMTG (estimate = 0.6, SE = 0.16, p = 0.001), suggesting that the effect of sentence offset was sustained for some time after the end of the sentence.Word surprisal elicited later BOLD peaks in comprehension and earlier BOLD peaks in production, relative to the canonical HRF (difference estimate = 0.8, SE = 0.24, p = 0.007).In comprehension, BA45 and LpMTG were both significantly related to a decrease in activity (BA45: estimate = 0.43, SE = 0.16, p = 0.007; LpMTG: estimate = 0.61, SE = 0.16, p = 0.0001).In production, activity in both BA44 and BA45 increased with the temporal derivative for surprisal (BA44: estimate = 0.49, SE = 0.2, p = 0.023; BA45: estimate = 0.96, SE = 0.2, p < 0.0001).These results suggest that word surprisal elicited earlier activity increases in production than comprehension, which likely relates to the timing of lexical access (before word onset in production, after word onset in comprehension).The open nodes measure showed earlier BOLD responses in the LpMTG in production (estimate = 1.03,SE = 0.5, p = 0.035), and later responses in BA44 and BA45 in production (BA44 estimate = 1.3, SE =0.5, p = 0.008; BA45 estiamte = 1.3, SE = 0.5, p = 0.02).Again, BA45 and the LpMTG had different BOLD peak latencies, suggesting that BA45 responded earlier to top-down nodes but later to open nodes, while LpMTG responded earlier to open nodes and later to top-down nodes.

Additional ROI analysis
For comparability with previous studies, we ran the same analysis presented in "Distinct dynamics for phrase-structure building in language production vs. comprehension" in a few additional regions often found to respond to language processing, but not known to be strongly involved in syntactic processing: left anterior temporal lobe (LATL), right anterior temporal lobe (RATL), left inferior parietal lobule (LIPL), left middle frontal gyrus (LMFG) (1)(2)(3).Methods.As done for the main regions of interest, we made anatomical masks of these regions using the Harvard Oxford atlas (max probability 25%).The LATL and RATL masks included the labels for the temporal pole, the anterior superior, middle and inferior temporal gyri.The LIPL mask included the left posterior supramarginal gyrus and the angular gyrus.The LMFG included the mask for the left middle frontal gyrus.We then only used the grey matter voxels in functional space for each participant, using freesurfer's aparc.a2009sgrey matter mask.We averaged the timeseries for each of these regions and ran the analysis with linear mixed-effects models as specified in the fMRI analysis methods section.Instead of using ROI as a fixed effect, now we ran separate analyses for each region (the values reported below are not corrected for the comparison of four ROIs, thus p < 0.0125 indicates a significant effect after correction).
Activity in the RATL increased as a function of words (χ 2 = 67.9,p < 0.0001), syllables (χ 2 = 4.2, p < 0.05), top-down parser (χ 2 = 8.5, p < 0.004) and sentence offset (χ 2 = 7.2, p < 0.008), and decreased for sentence onset (χ 2 = 4.4, p < 0.04).There was also an interaction between modality and bottom-up (χ 2 = 4.3, p < 0.04), modality and word surprisal (χ 2 = 4.2, p < 0.04) and modality and sentence offset (χ 2 = 6.1, p < 0.02).Pairwise comparisons indicated that there was a higher response to bottom-up counts in production than comprehension (estimate = 0.5, p = 0.038), there was a higher response to word surprisal in comprehension than production (estimate = 0.11, p = 0.039), and there was a higher response to sentence offset in comprehension than production (estimate = 3.5, p = 0.013).Therefore, the RATL showed a different pattern of responses to the syntactic predictors than the other regions, suggesting that the computations taking place in the right ATL may differ in timing pressures from the left lateralized regions of interest.
Overall, all regions except for the LIPL showed some sensitivity to syntactic predictors that broadly matched the responses in the LpMTG and LIFG, with differences in the sensitivity to top-down and bottom-up counts between production and comprehension.The response of these regions to the syntactic predictors may have been due to the contribution of these regions in overall sentence-level compositional processes.

Speech fluency
Methods.We analysed whether word and syntactic predictors also explained variance in word duration and in length of pauses before word uttering in the production data.Pauses were defined as the interval between the offset of the previous word and the onset of the current word.We used linear mixed-effects models with pause length or word duration as dependent variables.We used word frequency, word surprisal, number of syllables, top-down, bottom-up and open nodes predictors as fixed effects and as by-participant random slopes (Fig. S3).
We also determined how predictors of syntactic complexity related to speech fluency.Top-down node counts predicted the largest decrease in word duration (β = -0.045,SE = 0.002, t = 20.1, χ 2 = 404.1,p < 0.0001), suggesting that when phrases are opened, information can be conveyed faster, possibly to offload working memory.It also predicted the largest increase in pause length before the word in question is uttered (β = 0.09, SE = 0.008, t = 10.9, χ 2 = 119.7,p < 0.0001) suggesting that grammatical encoding related to a word is performed before word articulation, and that nodes are built in an anticipatory way.Bottom-up parser operations predicted an opposite pattern.Larger bottom-up counts increased word duration (β = 0.012, SE = 0.0009, t = 12.6, χ 2 = 159.5,p < 0.0001), but decreased pause length (β = 0.021, SE = 0.002, t = 11.6,χ 2 = 135.7,p < 0.0001).The shorter pauses suggest that at phrase closing the structure is already computed.Finally, open nodes predicted a significant but very small decrease in word duration (β = -0.002,SE = 0.0004, t = 5.3, χ 2 = 28.5, p < 0.0001), and a larger decrease in pause length (β = -0.024,SE = 0.002, t = 14.6, χ 2 = 213.9,p < 0.0001), suggesting easier processing the further along in a sentence.In line with the neuroimaging results, this pattern of results suggests that phrase-structure building happens before word articulation and at phrase-opening, with a decrease in pauses the further along in the sentence.
This was the first study to show an increase in neural activity for words associated with higher surprisal, not only in comprehension but also in production.Many studies showed sensitivity of brain activity to surprisal in language comprehension computed with several models (4,5).The neural results are in line with the behavioural results that show an increase in pause length before less probable words and a small increase in their duration, as found previously (6).The results thus converge in demonstrating the sensitivity of the production system to the statistical probabilities of the linguistic input and output, both in behavioural and neural patterns.This finding is in line with accounts of efficient language production that propose a uniform distribution of information in discourse (Uniform Information Density, (7-10).More informative units (in information-theoretic terms, i.e. larger surprisal in the current study) take more time in discourse, while redundant units can be uttered faster or eliminated (e.g. for optional words like complementizer that (7)).

Data Acquisition
The acquisition parameters were identical in the two datasets.MRI data was collected on a 3T full-body scanner (Siemens Skyra) with a 20-channel head coil.Functional images were acquired using a T2*-weighted echo planar imaging pulse sequence (TR 1500 ms, TE 28 ms, flip angle 64, whole-brain coverage 27 slices of 4 mm thickness, in-plane resolution 3 × 3 mm 2 , FOV 192 × 192 mm 2 ).Anatomical images were acquired using a T1-weighted MPRAGE pulse sequence (0.89 mm 3 resolution).fMRI preprocessing Preprocessing was performed using fMRIPrep 20.2.6 (11), which is based on Nipype 1.7.0 (12,13).Anatomical data preprocessing The T1-weighted (T1w) image was corrected for intensity non-uniformity (INU) with N4BiasFieldCorrection ( 14), distributed with ANTs 2.3.3 (15), and used as T1w-reference throughout the workflow.The T1w-reference was then skull-stripped with a Nipype implementation of the antsBrainExtraction.sh workflow (from ANTs), using OASIS30ANTs as target template.Brain tissue segmentation of cerebrospinal fluid (CSF), whitematter (WM) and gray-matter (GM) was performed on the brain-extracted T1w using fast (FSL 5.0.9 ( 16)).Brain surfaces were reconstructed using recon-all (FreeSurfer 6.0.1, ( 17)), and the brain mask estimated previously was refined with a custom variation of the method to reconcile ANTs-derived and FreeSurfer-derived segmentations of the cortical gray-matter of Mindboggle (18).Functional data preprocessing For each BOLD run, the following preprocessing was performed.First, a reference volume and its skull-stripped version were generated using a custom methodology of fMRIPrep.Susceptibility distortion correction (SDC) was omitted, because no fieldmap was acquired.The BOLD reference was then co-registered to the T1w reference using bbregister (FreeSurfer) which implements boundary-based registration (19).Co-registration was configured with six degrees of freedom.Head-motion parameters with respect to the BOLD reference (transformation matrices, and six corresponding rotation and translation parameters) are estimated before any spatiotemporal filtering using mcflirt (FSL 5.0.9, (20)).The BOLD time-series (including slice-timing correction when applied) were resampled onto their original, native space by applying the transforms to correct for head-motion.These resampled BOLD time-series will be referred to as preprocessed BOLD in original space, or just preprocessed BOLD.The BOLD time-series were resampled into standard space, generating a preprocessed BOLD run in MNI152NLin2009cAsym space.First, a reference volume and its skull-stripped version were generated using a custom methodology of fMRIPrep.Automatic removal of motion artifacts using independent component analysis (ICA-AROMA, ( 21)) was performed on the preprocessed BOLD on MNI space time-series after removal of non-steady state volumes and spatial smoothing with an isotropic, Gaussian kernel of 6mm FWHM (full-width half-maximum).The "aggressive" noise-regressors were collected and placed in the corresponding confounds file.Several confounding time-series were calculated based on the preprocessed BOLD: framewise displacement (FD), the derivative of the relative (frame-to-frame) bulk head motion variance (DVARS) and three region-wise global signals.FD was computed using two formulations following Power (absolute sum of relative motions, ( 22)) and Jenkinson (relative root mean square displacement between affines, (20)).FD and DVARS are calculated for each functional run, both using their implementations in Nipype (following the definitions by (22).Additionally, a set of physiological regressors were extracted to allow for component-based noise correction (CompCor, (23)).Principal components are estimated after high-pass filtering the preprocessed BOLD time-series (using a discrete cosine filter with 128s cut-off) for anatomical CompCor (aCompCor).For aCompCor, three probabilistic masks (cerebrospinal fluid (CSF), white matter (WM) and combined CSF+WM) are generated in anatomical space.

Fig. S1 .
Fig. S1.Dependency parse of the sentence.Left-relations are in orange, right relations in green, heads are in purple.Heads are words on which a dependency relation is attached (i.e. from which an arrow starts).

Fig. S2 .
Fig. S2.Beta estimates for the effect of the temporal derivative of each predictor on BOLD activity in the regions of interest.Error bars represent standard error of the mean.Positive estimates indicate an earlier BOLD response, negative estimates indicate a later BOLD response.Note that the y-axis range differs between plots.

Fig. S3 .
Fig. S3.Beta estimates for the effect of each predictor on BOLD activity in additional regions.Error bars represent standard error of the mean.Note that the y-axis range differs between plots.

Fig. S4 .
Fig. S4.Estimates in seconds of the effect of each predictor of word characteristics and phrasestructure building on word durations and pause length before word articulation.Error bars represent standard error of the mean.Individual points represent each participant's estimate as estimated by the random slopes.The model estimated identical random slopes for number of syllables on pause length for each participant.

Fig. S5 .
Fig. S5.Correlation matrix showing Pearson's r correlation among all predictors after they were convolved with the haemodynamic response function (corresponding to 16338 individual time points).Note that not all predictors were used in the same model.

Fig. S6 .
Fig. S6.Correlation matrix showingPearson's r correlation among all predictors before they were convolved with the haemodynamic response function (51606 words across production and comprehension).Note that the word rate predictor is not present because it is a vector of 1s (after convolving it captures how often words are said/heard, based on their onset times).

Table S1 .
Summary of model output of BOLD activity in production and comprehension.ROI1 refers to the contrast BA44 vs. BA45, ROI2 refers to the contrast BA44 & BA45 vs. pMTG.Mod stands for modality.AIC stands for Akaike Information Criterion, used for the production only models to determine model fit.

Table S2 .
Summary of model output of BOLD activity in production with top-down predictor.ROI1 refers to the contrast BA44 vs. BA45, ROI2 refers to the contrast BA44 & BA45 vs. pMTG.Mod stands for modality.AIC stands for Akaike Information Criterion, used for the production only models to determine model fit.

Table S3 .
Summary of model output of BOLD activity in production with early topdown predictor.ROI1 refers to the contrast BA44 vs. BA45, ROI2 refers to the contrast BA44 & BA45 vs. pMTG.Mod stands for modality.AIC stands for Akaike Information Criterion, used for the production only models to determine model fit.

Table S4 .
Summary of model output of BOLD activity in production with chunked topdown predictor.ROI1 refers to the contrast BA44 vs. BA45, ROI2 refers to the contrast BA44 & BA45 vs. pMTG.Mod stands for modality.AIC stands for Akaike Information Criterion, used for the production only models to determine model fit.

Table S5 .
Summary of model output of BOLD activity in production and comprehension, including the temporal derivative (der) of all predictors of interest.ROI1 refers to the contrast BA44 vs. BA45, ROI2 refers to the contrast BA44 & BA45 vs. pMTG.Mod stands for modality.AIC stands for Akaike Information Criterion, used for the production only models to determine model fit.

Table S6 .
Summary of model output of the pause length preceding each word's production.AIC stands for Akaike Information Criterion, used for the production only models to determine model fit.

Table S7 .
Summary of model output of word duration.AIC stands for Akaike Information Criterion, used for the production only models to determine model fit.