Speech identification based on temporal fine structure cues - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Jul;124(1):562-75.
doi: 10.1121/1.2918540.

Speech identification based on temporal fine structure cues

Affiliations

Speech identification based on temporal fine structure cues

Stanley Sheft et al. J Acoust Soc Am. 2008 Jul.

Abstract

The contribution of temporal fine structure (TFS) cues to consonant identification was assessed in normal-hearing listeners with two speech-processing schemes designed to remove temporal envelope (E) cues. Stimuli were processed vowel-consonant-vowel speech tokens. Derived from the analytic signal, carrier signals were extracted from the output of a bank of analysis filters. The "PM" and "FM" processing schemes estimated a phase- and frequency-modulation function, respectively, of each carrier signal and applied them to a sinusoidal carrier at the analysis-filter center frequency. In the FM scheme, processed signals were further restricted to the analysis-filter bandwidth. A third scheme retaining only E cues from each band was used for comparison. Stimuli processed with the PM and FM schemes were found to be highly intelligible (50-80% correct identification) over a variety of experimental conditions designed to affect the putative reconstruction of E cues subsequent to peripheral auditory filtering. Analysis of confusions between consonants showed that the contribution of TFS cues was greater for place than manner of articulation, whereas the converse was observed for E cues. Taken together, these results indicate that TFS cues convey important phonetic information that is not solely a consequence of E reconstruction.

PubMed Disclaimer

Figures

Figure 1
Figure 1
For a narrowband ∕asa∕ speech signal, instantaneous-frequency functions with processing algorithm indicated in the upper right-hand corner of the panel, and long-term magnitude spectra (bottom-right panel). Analysis used a 0.4-octave-wide, third-order zero-phase Butterworth filter centered at 900 Hz. Magnitude spectra: Unprocessed speech (dashed line), PMz speech (continuous thin line), PMr speech (continuous line), and FM speech (continuous heavy line). Magnitude spectra were smoothed for clarity of presentation. In large part the two PM magnitude spectra overlap.
Figure 2
Figure 2
Assessment of envelope reconstruction in condition 16B. Mean correlation estimates between the original speech envelopes and the envelopes of the stimuli in the PMz (circles), PMr (triangles), and FM (squares) conditions are shown a function of gammatone-filter CF. Left panel: correlation coefficient; middle panel: depth-dependent correlation estimate; right panel: level-dependent correlation estimate. The number in parentheses below each panel title is the mean correlation across gammatone-filter channels estimating E fidelity in the E condition. In each figure legend, numbers between parentheses correspond to the mean correlation estimate across filter channels in the respective TFS condition.
Figure 3
Figure 3
Mean identification performance (left panel) and percent of information received for each phonetic feature (middle and right panels) as a function of session number for the 16-band PMz (open circles), PMr (filled circles), and FM (filled triangles) TFS-speech conditions. Mean performance averaged across four repeated session in the 16-band E-speech condition is indicated with stars on the right side of each panel. Error bars represent one standard error of the mean. A score of 6.25% corresponds to chance identification performance.
Figure 4
Figure 4
Mean identification performance (left panel) and percent of information received for each phonetic feature (middle and right panels) calculated in each speech-processing condition: PMz (black bars), PMr (light gray bars), FM (dark gray bars), E (open bars). In each panel, data are from the 16B [16-band analysis filterbank, 80 dB(A)], HF [16-band analysis filterbank with the five lowest bands removed, 80 dB(A)], 32B [32-band analysis filterbank, 80 dB(A)], and LL [16-band analysis filterbank, 45 dB(A)] conditions. Error bars represent one standard error of the mean.
Figure 5
Figure 5
For experiment 3, assessment of envelope reconstruction in condition 32B. Stimuli were processed with a 32-band analysis filterbank. Except for the omission of the mean correlation estimates for the E condition which was not run in experiment 3, otherwise as in Fig. 2.
Figure 6
Figure 6
For experiment 4, assessment of envelope reconstruction in condition LL. Stimulus level was 45 dB SPL, otherwise as in Fig. 2.
Figure 7
Figure 7
Assessment of fidelity of TFS transmission in condition 16B. Mean correlation estimates computed between the TFS of the original and processed speech stimuli in the PMz (open circles), PMr (open triangles), and FM (open squares) conditions are shown a function of gammatone-filter CF. Left panel: correlation coefficient; right panel: level-dependent correlation estimate. In each figure legend, numbers between parentheses correspond to the mean correlation estimate computed across gammatone-filter channels.
Figure 8
Figure 8
Assessment of fidelity of TFS transmission in condition 32B. Stimuli were processed with a 32-band analysis filterbank, otherwise as in Fig. 7.
Figure 9
Figure 9
Mean identification scores across listeners as a function of fidelity of E reconstruction (top panels) and TFS transmission (bottom panels). Each panel corresponds to a given correlation estimate. In each panel, individual symbols correspond to a given experimental condition (16B, HF, 32B, LL) and TFS-speech processing scheme (PMz, PMr, FM). Variance of identification scores accounted for by each correlation estimate (R2) is shown in each panel.

Similar articles

Cited by

References

    1. Bernstein, J. G. W., and Oxenham, A. J. (2006). “The relationship between frequency selectivity and pitch discrimination: Effects of stimulus level,” J. Acoust. Soc. Am. JASMAN10.1121/1.2372451 120, 3916–3928. - DOI - PubMed
    1. Crouzet, O., and Ainsworth, W. A. (2001). “On the various influences of envelope information on the perception of speech in adverse conditions: An analysis of between-channel envelope correlation,” in Workshop on Consistent and Reliable Cues for Sound Analysis (Aalborg, Denmark).
    1. de Cheveigné, A., and Kawahara, H. (2002). “YIN, a fundamental frequency estimator for speech and music,” J. Acoust. Soc. Am. JASMAN10.1121/1.1458024 111, 1917–1930. - DOI - PubMed
    1. Drennan, W. R., Won, J. H., Dasika, V. K., and Rubenstein, J. T. (2007). “Effects of temporal fine structure on the lateralization of speech and on speech understanding in noise,” J. Assoc. Res. Otolaryngol. ZZZZZZ 8, 373–383. - PMC - PubMed
    1. Drullman, R. (1995). “Temporal envelope and fine structure cues for speech intelligibility,” J. Acoust. Soc. Am. JASMAN10.1121/1.413112 97, 585–592. - DOI - PubMed

Publication types