Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / NEUROSCIENCE
Musical intervals in speech
Center for Cognitive Neuroscience and Department of Neurobiology, Duke University, Durham, NC 27708
Contributed by Dale Purves, April 5, 2007 (received for review January 29, 2007)
| Abstract |
|---|
|
|
|---|
language | music | formants | scales | perception
Intuitively, the most obvious place to look for musical intervals in human vocalizations would be in vocal prosody, i.e., the rising and falling pitches that characterize normal speech. When we examined recorded speech from this perspective, however, we failed to find any definitive evidence of musical intervals [see supporting information (SI) Text]. We thus turned to the possibility that the intervals of the chromatic scale are embedded in the spectral relationships within speech sound stimuli (called phones) that differentiate the phonemes perceived (4).
The periodicity in speech sound stimuli is generated primarily by the repeating peaks of energy in the vocal air stream produced by oscillations of the vocal folds in the larynx. The intensity carried by the harmonic series produced in this way is altered, however, by the resonance frequencies of the rest of the vocal tract, which change dynamically in response to neurally controlled movements of the soft palate, tongue, lips and other articulators (Fig. 1A). These variable vocal tract resonances, called formants, modulate the harmonic series generated by the laryngeal oscillations by suppressing some harmonics more than others (4, 5, 7, 8).
When coupled with unvoiced speech sounds (consonants), this modulation by the formants creates the different voiced speech sounds that give rise to the semantic content in all human languages. With respect to vowel phones, only the first two formants have a major influence on the vowel perceived: artificially removing them from vowel phones makes vowel phonemes largely indistinguishable, whereas removing the higher formants has little effect on the perception of speech sounds
(see SI Text). Indeed, the first and second formants of vowel sounds of all languages fall within well defined frequency ranges (4, 712). The resonances of the first two formants are typically between
2001,000 Hz and
8003,000 Hz, respectively, their central values approximating the odd harmonics of the resonances of a tube
17 cm in length open at one end, the usual physical model of the adult vocal tract in a relaxed state (4, 5, 7, 8).
|
| Results |
|---|
|
|
|---|
Fig. 2 shows representative examples from the database for the three "point vowels" in English, i.e., the vowels whose formants are furthest apart in the F1 x F2 plot (vowel space) typically used in psycholinguistic studies (7); the most intense harmonic in the first and second formants of each utterance is indicated. The inset keyboards show that when the harmonic peak of the first formant of any vowel utterance in the database is set to a note represented on a piano tuned in just intonation, the peaks of intensity in the second formant often, but not always, fall on another note on the keyboard. Thus the ratio of the second to the first formant often represents one of the ratios that define chromatic scale intervals.
|
|
|
We next assessed whether the bias toward a representation of chromatic scale ratios is specific to the modulation of the speech signal by the first two formants, or whether a similar bias would arise from an analysis of any harmonic series. This possibility arises because the ratios of the overtones of a harmonic series can, because they are integers, generate the 12 musical ratios that define the notes of the chromatic scale. SI Fig. 7 shows, however, that, when all possible comparisons of harmonic pairs up to the 26th harmonic are considered (the highest harmonic in the second formant; see Fig. 1), the occurrence of musical intervals falls to 36% and the prevalence of nonmusical intervals increases to 64% (compare Fig. 3). Thus, the biases in Fig. 3 are specific to speech.
More significantly, Fig. 4A shows that, if the available range of harmonics were less than the range found in speech, the number of chromatic intervals represented would be diminished; in the example shown, 2 of the 12 chromatic intervals are missing altogether and 3 others are only weakly represented. Conversely, if the ranges of the harmonic peaks of the first two formants were greater than the range found in human speech (see Fig. 1), the intervals of the chromatic scale would be diluted by additional nonmusical intervals (black bars, Fig. 4B). Thus, the full range of chromatic intervals with minimal dilution by other intervals is specifically determined by the neural control of the vocal articulators in speech production.
|
|
50-word monologues (see Methods). The results derived from an analysis of all voiced speech segments in the monologues were similar to the results for the single word utterances (Table 3 and Fig. 5).
|
|
| Discussion |
|---|
|
|
|---|
This conclusion is relevant to a number of unanswered questions in music, musicology, linguistics, and cognitive neuroscience. For example, if the source of musical intervals is indeed the formant ratios in speech, then the present results are pertinent to the longstanding argument in music about which of several tuning systems is "natural" (23). In so far as the observations here inform this argument, the observed ratios in speech spectra accord most closely with a just intonation tuning system. Ten of the 12 intervals generated by the analysis of either English or Mandarin vowel spectra are those used in just intonation tuning, whereas 4 of the 12 match the Pythagorean tuning and only 1 of the 12 intervals matches those used in equal temperament. The two anomalies in our data with respect to just intonation concern the minor second and the tritone. The interval ratio of the minor second defined by F2/F1 in speech is 1.0625 whereas, in just intonation (which is based on maintaining perfect fifths and major thirds in each octave), this interval is 1.0667. This difference occurs because 1.0667 is the ratio of 16:15, which does not occur in speech because the range of maximum intensity in the first formant peak extends only up to the 10th harmonic. Our value of 1.0625 for the minor second arises from formant ratios of 17:8, 17:4, and 17:2 (see Fig. 3 and Methods). Similarly, our value for the tritone is 1.400 whereas the just intonation value is 1.406. This difference arises because 1.406 is the ratio of 45:32, which again does not occur in speech, in this case because the range of maximum intensity in the second formant peak extends only up to the 26th harmonic. The values 1.400 in speech arise from the F2/F1 ratios in the database of 7:5, 14:5, 21:5, 14:10, and 21:10 (see again Fig. 3 and Methods). In summary, just intonation tuning closely fits the chromatic scale defined by speech data. The fact that instruments in just intonation tuning are widely agreed to sound "brighter" than in the equal temperament tuning used for the last three centuries (9) (a compromise that allows multiple instruments to play pieces that include notes in more than one key) is in keeping with our conclusion that the chromatic scale arises from formant ratios in speech.
A second fascinating question is whether the tonal preferences in the music of a culture can be rationalized in terms of the formant relationships of the voiced speech sounds prevalent in the relevant language. If the chromatic scale derives from experience with the formant relationships used to elicit different phonemes, then the speech sounds of a particular language might be expected to influence the subsets of the chromatic scale used in the music of that culture (2427). Analyses of cultural scale preferences in relation to the spectral characteristics of the language or languages of a given culture should be possible using the approach described here.
A third question of interest concerns the widespread preference across cultures for diatonic (seven-note) and pentatonic (five-note) subsets of the chromatic scale in creating music (1822, 27). The pentatonic scale in particular is the basis for much ethnic ("folk") music worldwide. It is noteworthy in this respect that, of the chromatic intervals in our data,
70% are components of the pentatonic scale and
80% of the diatonic scale (see SI Table 4). This prevalence suggests that the general preference for diatonic and pentatonic scales arises from the greater familiarity with these formant ratios in the speech of any language.
Further questions that can be explored in these terms arise from other aspects of the phenomenology of musical scales and their impact on listeners. For example, could the different emotional impact of major and minor musical scales be based on variations in the predominant intervals among vowel formants uttered in different physiological states (e.g., excitement versus the subdued physiology that characterizes sadness)? And what, in these terms, is the significance of the tonic anchor in musical composition and performance?
Finally, it will be of interest to examine in this framework how formant relationships in the vocalizations of nonhuman primates and other animals compare with those in humans, and what such evidence could indicate about the origins of both speech and music.
| Methods |
|---|
|
|
|---|
, æ, a,
, U, u/) and consonants (/b, d/) were chosen based on the rationale of Hillenbrand and Clark (28) (in particular, vowel phone intelligibility is maximized by this consonant framing). The words were spoken at a conversational level of intensity (
70 dB) and speed (mean duration, 523 ms; SD = 159 ms) in an emotionally neutral manner. Each word was repeated seven times; by analyzing only the central five of these utterances, we could avoid onset and offset effects. Participants paused for 30 s between saying each of four differently ordered lists of the words. After a break of at least 30 min, this entire procedure was repeated four more times; thus, we obtained 100 samples of each of the eight words for each participant. In the Mandarin control, only six words representing the major vowels in this language (ba, ge, bo, bi, du, and jü) (29) were used; the words were spoken by three male and three female native speakers ranging in age from 2231 years of age. The procedure was the same as for English except that each word was uttered in each of the four major tones used in Mandarin (the fifth neutral tone form was not included because it is rarely used, comprising only
6% of vowel utterances in Mandarin speech (30). Both the English and Mandarin speaking participants also read aloud five monologues
that contained
50 words each (Table 3), recording each monologue twice in an emotionally neutral manner. All utterances were recorded in a closed, sound-attenuating chamber by using an Audio-Technica AT4049a omnidirectional capacitor microphone fed into a Marantz (Martel Electronics, Yorba Linda, CA) PMD670 solid-state recorder. The participants followed a series of simple instructions presented graphically, and the quality of their performance was monitored remotely. Sound files were saved to a Scandisk 1 flash memory card in uncompressed digital .wav format at a sampling rate of 22.05 kHz, and transferred from the flash memory card to a Dell Dimension 9150 computer for analysis.
Analysis.
The recorded samples were analyzed by using Praat software (v.4.5) (32). A Praat script was used to generate a text grid and to automatically mark pauses at the onset/offset of each word; vowel identifier and positional information were then inserted manually for each utterance. The text grid was stored with the associated .wav file, and a second script was implemented to extract values (in hertz) for the fundamental frequency, as well as for the first and second formants from a 50-ms segment at the midpoint of each vowel utterance (thus yielding one value for each word uttered; 50 ms is the standard integration window in Praat). The frequency range analyzed was individually adjusted for male and female speakers (5 formants >
5,000 Hz for males, but up to
5,500 Hz for females). To extract the formant values, Praat uses a Gaussian-like window to compute the linear predictive coding coefficients using the algorithm in ref. 33.
For the monologue data, Praat's pitch- and formant-listing utilities were used to extract and time-stamp the F0 (if present), F1, and F2 values at 10-ms intervals. Tracking the formants in this way is necessary in natural speech because of the greater degree of coarticulation compared with the somewhat artificial utterance of single words. The frequencies that define the formants vary less over the mid-region of the vowel nucleus, where the effects of coarticulation are minimal (34). Standard pitch settings were used and the frequency range was set at 75600 Hz. The formant settings were adjusted in the same manner as was used for the single word condition. Any 10-ms time interval that contained no F0 was removed from the data.
For both the word and monologue data, the nearest harmonic peak to the underlying formant maximum given by Praat was used as an index of the formants: the formant value assigned by rlinear predictive coding was divided by the fundamental frequency, and the result was rounded to the nearest integer. The ratios of the indices of the first two formants were then calculated as B/A where B = the formant 2 harmonic index and A = formant 1 harmonic index [the data were plotted as log2(B/A), as is conventional]. Ratios were counted as chromatic if they corresponded to just intonation values for the chromatic scale (see Discussion).
Octave Collapse. The perceived similarity of tones an octave apart is so pronounced that it is termed octave equivalence (31). On this basis, we collapsed the results in Tables 1 and 2 into a single octave to allow a more direct comparison of the distribution of intervals found in speech in the two languages being compared.
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Freely available online through the PNAS open access option.
Author contributions: D.R., J.C., and D.P. designed research; D.R. and J.C. performed research; D.R. and J.C. analyzed data; and D.R., J.C., and D.P. wrote the paper.
The authors declare no conflict of interest.
Schouten, J. F., Fourth International Congress on Acoustics, August 2128, 1962, Copenhagen, Denmark, 196:201203. ![]()
This article contains supporting information online at www.pnas.org/cgi/content/full/0703140104/DC1.
McGilloway, S., Cowie, R., Douglas-Cowie, E., Gielen, S., Westerdijk, M., Stroeve, S., International Speech Communication Association Tutorial and Research Workshop (ITRW) on Speech and Emotion, September 57, 2000, Newcastle, Northern Ireland, U.K., pp. 207212. ![]()
© 2007 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||