Mapping visual symbols onto spoken language along the ventral visual stream

Significance Learning to read is the most important milestone in a child’s education. However, controversies remain regarding how readers’ brains transform written words into sounds and meanings. We address these by combining artificial language learning with neuroimaging to reveal how the brain represents written words. Participants learned to read new words written in 2 different alphabets. Following 2 wk of training, we found a hierarchy of brain areas that support reading. Letter position is represented more flexibly from lower to higher visual regions. Furthermore, higher visual regions encode information about word sounds and meanings. These findings advance our understanding of how the brain comprehends language from arbitrary visual symbols.

width and were left and bottom aligned when constructing written forms, ensuring a similar gap between symbols. Words therefore varied in width, but were centered on a white background 320 x 112 pixels.
Semantic forms. Two sets of 24 familiar objects were selected (pictures and English names).
Each set comprised 6 fruits or vegetables, 6 vehicles, 6 animals, and 6 tools. For each participant, for one language (henceforth the systematic language), the semantic categories were assigned to the trained items systematically according to the final symbol (e.g. animals were assigned to items ending in one symbol, tools to another symbol, etc.). For the other language (henceforth the arbitrary language), there was an arbitrary assignment of meaning to trained item, such that there was an equal probability of each final symbol occurring with each semantic category. The assignment of orthography to the systematic or arbitrary language was counterbalanced across participants. Note that the findings reported in the current manuscript were part of a larger behavioural study concerned with the learning and generalization of spelling-to-sound and spelling-to-meaning regularities, therefore comparisons of systematic versus arbitrary orthography-semantic mappings are not reported here. 1 Generalisation items. For each trained item, an untrained item was created that differed in either the vowel or final consonant as well as in the final silent symbol. These were used to assess extraction of symbol-sound mappings at the end of training. 1 A reviewer asked whether the orthography-semantic systematicity of the final letter impacted on the results reported in the current manuscript. At the end of training, saying the meanings of the artificial written words, a task similar to that used in the scanner, was equivalent in accuracy, t(22) = 1.10, ns, and response times, t(22) < 1, ns, for the systematic (mean accuracy = 93%, RT = 2182ms) and the arbitrary (mean accuracy = 91%, RT = 2133ms) language (note that one participant misunderstood the test task and so is excluded from these analyses). Furthermore, there was no evidence from exploratory analyses of the neuroimaging data that the systematicity of the final letter enhanced the representation of letter identity or position information (correlations between neural data and position-specific and spatial coding models no greater for the systematic than the arbitrary language).

Behavioural training and testing procedure
Spoken responses for all tasks were recorded and manually coded for accuracy and RT.
Keyboard and mouse accuracy and RTs were recorded by E-Prime.
Phonology-to-Semantic pre-training. Before beginning the orthography training, participants learned the spoken form to meaning associations for the 24 pseudowords from each language, using the procedure described in (1). At the end of three pre-training runs for each language they achieved good accuracy for saying the novel word to match the picture (Language 1 = 60% (SD = 22%), Language 2 = 64% (SD = 21%)), and for saying the meaning to match the heard novel word (Language 1 = 63% (SD = 22%), Language 2 = 62% (SD = 20%)).

Orthography training.
Participants learned about the two orthographies for ~1.5 hours per day, for nine consecutive days, with breaks for weekends. Four tasks were completed each day for each orthography and the order in which the tasks and orthographies were presented was varied across days.

i)
Reading aloud (24 trials, 4 repetitions). The orthographic forms of each of the 24 trained items were presented in a randomised order. Participants read them aloud, i.e., said their pronunciation in the new language, and then pressed spacebar to hear the correct answer.
ii) Saying the meaning (24 trials, 4 repetitions). As for reading aloud, but participants said the English meaning of each item aloud. were presented in a randomised order. Participants read them aloud, i.e., said their pronunciation in the new language, and then pressed spacebar to move onto the next item.
ii) Saying the meaning (24 trials). As for reading aloud, but participants said the English meaning of each item aloud.
iii) Generalization (24 trials). As for reading aloud, but participants were presented with untrained items and said their pronunciation. time. Acquisition was transverse oblique, angled to avoid the eyes and to achieve wholebrain coverage including the cerebellum. In a few cases the top of the parietal lobe was not covered. In each scan session a T1-weighted structural volume was also acquired using a magnetization prepared rapid acquisition gradient echo protocol (TR = 2250 ms, TE = 2.99 ms, flip angle = 9 degrees, 1mm slice thickness, 256x 240 x 192 matrix, resolution = 1 mm isotropic).
Two runs were collected on each day and in each run, 438 images were acquired.
Image processing and statistical analyses were performed using SPM8 (Wellcome Trust Centre for Functional Neuroimaging, London, UK). The first 6 volumes of each run were discarded to allow for equilibration effects. Slice timing correction was applied, referenced to the middle slice. Images for each participant were realigned to the first image in the run (2). For univariate analyses, images were coregistered to the structural image collected on the same day as scanning prior to normalization. For multivariate analyses, all functional images were coregistered to the structural image collected on the first scan day, since subsequent analyses of these data were conducted in native space (3). For both uni and multivariate analyses, the origin of all functional and structural images were then manually registered to the anterior commissure. The transformation required to bring a participant's structural T1 images into standard Montreal Neurological Institute (MNI) space was calculated using tissue probability maps (4). For univariate analyses, these warping parameters were applied to all functional images for that participant. Normalised functional images were then re-sampled to 2mm isotropic voxels and the data were spatially smoothed with 8mm full-width half maximum isotropic Gaussian kernel prior to model estimation. For multivariate analyses we used unsmoothed native space images.
Data from each participant were entered into general linear models for eventrelated analysis (5). In all models, events were convolved with the SPM8 canonical hemodynamic response function. Movement parameters estimated at the realignment stage of pre-processing were added as regressors of no interest in addition to the session mean. Low frequency drifts were removed with a high-pass filter (128s) and AR1 correction for serial autocorrelation was made.

Justification of multiple regression method
We used multiple regression rather than partial correlation because we sought to determine the independent variance in the neural response patterns explained by either/both DSMs across each region. Multicollinearity diagnostics indicated that the correlation between the similarity values for the two coding schemes was not too high to justify using multiple regression methods, Spearman r(552) = .86, VIF = 4.07, i.e. VIF < 10.

Open-bigram coding analysis
We conducted a similar analysis using a predicted DSM derived from open-bigram coding (6), in which similarity between items depends on shared contiguous or non-contiguous same order letter pairs (values generated using Match Calculator). Multicollinearity diagnostics for the position-specific and open-bigram coding DSMs were, r(552) = .61, VIF = 1.58. Results (SI Appendix, Table S6, Figure S1, S2) were broadly similar to the analysis including position-specific and spatial coding predicted DSMs (although open-bigram coding accounted for independent variance in fewer of the right hemisphere vOT ROIs).

SI Discussion
The present experiment did not attempt to dissociate sensitivity to within-word position from sensitivity to retinal location since written words were always presented at fixation.
However, these factors were not entirely confounded, because letters varied in width and words were constructed to ensure a similar gap between letters. Therefore, words also varied in width. Consequently, retinal location was not exactly the same even for the same letter in the same within-word position. Thus, even correlations between the positionspecific letter DSM and the neural response patterns may indicate a degree of tolerance to retinal location shifts.

Table S1
Brain regions active when viewing learned words, relative to un-modelled resting baseline for 24 participants. Top 20 peaks > 8mm apart are reported at a threshold of p < .001 uncorrected, and p < .05 FWE cluster corrected. Bold text denotes the first peak within a cluster. Anatomical labels in this and all subsequent tables were generated using the automated anatomical labelling template (7)

Table S5
Results of second-level one-sample t-tests on Fisher transformed Spearman rank correlations between predicted and neural DSMs. For each participant, correlations were extracted from left and right hemisphere vOT ROIs following whole-brain searchlight analyses. The predicted DSMs are a visual model computed using the simple cell representations from the HMAX model (1 minus correlation between s1 layer representations of item pairs), a position-specific letter model (1 -proportion of sameposition letters shared between items), and a more position-invariant letter model (1spatial code similarity). Correlations that are significantly greater than zero (one-tailed ttest) are shown in bold.