The interrelationship between the face and vocal tract configuration during audiovisual speech

doi:10.1073/pnas.2006192117

. 2020 Dec 22;117(51):32791-32798.

doi: 10.1073/pnas.2006192117. Epub 2020 Dec 8.

The interrelationship between the face and vocal tract configuration during audiovisual speech

Chris Scholes¹, Jeremy I Skipper², Alan Johnston³

Affiliations

¹ Visual Neuroscience Group, School of Psychology, University of Nottingham, NG7 2RD Nottingham, United Kingdom; chris.scholes@nottingham.ac.uk.
² Experimental Psychology, University College London, WC1H 0AP London, United Kingdom.
³ Visual Neuroscience Group, School of Psychology, University of Nottingham, NG7 2RD Nottingham, United Kingdom.

PMID: 33293422
PMCID: PMC7768679
DOI: 10.1073/pnas.2006192117

The interrelationship between the face and vocal tract configuration during audiovisual speech

Chris Scholes et al. Proc Natl Acad Sci U S A. 2020.

. 2020 Dec 22;117(51):32791-32798.

doi: 10.1073/pnas.2006192117. Epub 2020 Dec 8.

Authors

Chris Scholes¹, Jeremy I Skipper², Alan Johnston³

Affiliations

¹ Visual Neuroscience Group, School of Psychology, University of Nottingham, NG7 2RD Nottingham, United Kingdom; chris.scholes@nottingham.ac.uk.
² Experimental Psychology, University College London, WC1H 0AP London, United Kingdom.
³ Visual Neuroscience Group, School of Psychology, University of Nottingham, NG7 2RD Nottingham, United Kingdom.

PMID: 33293422
PMCID: PMC7768679
DOI: 10.1073/pnas.2006192117

Abstract

It is well established that speech perception is improved when we are able to see the speaker talking along with hearing their voice, especially when the speech is noisy. While we have a good understanding of where speech integration occurs in the brain, it is unclear how visual and auditory cues are combined to improve speech perception. One suggestion is that integration can occur as both visual and auditory cues arise from a common generator: the vocal tract. Here, we investigate whether facial and vocal tract movements are linked during speech production by comparing videos of the face and fast magnetic resonance (MR) image sequences of the vocal tract. The joint variation in the face and vocal tract was extracted using an application of principal components analysis (PCA), and we demonstrate that MR image sequences can be reconstructed with high fidelity using only the facial video and PCA. Reconstruction fidelity was significantly higher when images from the two sequences corresponded in time, and including implicit temporal information by combining contiguous frames also led to a significant increase in fidelity. A "Bubbles" technique was used to identify which areas of the face were important for recovering information about the vocal tract, and vice versa, on a frame-by-frame basis. Our data reveal that there is sufficient information in the face to recover vocal tract shape during speech. In addition, the facial and vocal tract regions that are important for reconstruction are those that are used to generate the acoustic speech signal.

Keywords: PCA; audiovisual; speech.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

**Fig. 1.**
Overview of image sequence reconstruction after application of PCA to hybrid video–MR images. (A) Example frame for actor 8 and sentence 1. Frame images are shown for both facial video and MR sequences, with a depiction of the warp field necessary to transform this image to the reference image shown in the panel underneath. The bar to the right shows the 1D serial vector for a slice through this frame, taken column-wise from the RGB pixel values in the image and the x and y warp vectors. (B) Serial vectors are concatenated across frames to create a 2D array which acts as the input to the PCA (slices through each frame are depicted here, for display purposes, but the full frames were used in the PCA input). (C) One modality was reconstructed using the input for the other modality and the PCA. The illustration shows the values for the MR modality have been set to zero. This array was then projected into the PCA space, and the MR sequence was reconstructed. In the same way, but not depicted here, the video sequence was reconstructed from the MR sequence and the PCA. (D) Reconstructed images for the example frame shown in A. (E) Reconstructed loadings as a function of the original PCA loadings for all frames (blue dots) and for the example frame (red dots), with the reconstructed modality indicated in each panel (VID = facial video).

**Fig. 2.**
The ability to reconstruct vocal tract images from the PCA is dependent on the correspondence between the configurations in the video and MR sequences. (A) Loadings for the first six PCs across all frames for the example actor and sentence for the original sequence (blue) and both the reconstructed MR (orange) and video (gold) sequence. (B) Loading correlation for the example actor and sentence for the original MR reconstruction (dashed line) and the distribution of loading correlations for 1,000 permutations of randomized MR frame order. The mean (solid bar) and 95% confidence intervals (lighter bars) are indicated above the distribution. (C) Original loading correlation (circles) and shuffled loading correlation mean (solid bar) and 95% confidence intervals (lighter bars) across all sentences for one actor and one sentence (sentence 1) for nine actors.

**Fig. 3.**
Implicitly including temporal information in the PCA input leads to increased reconstruction fidelity. (A) Paired-frame PCA input. (B) Paired-frame loading correlation as a function of single-frame loading correlation for unshuffled (circles) and shuffled (squares) sequences. (C) Original paired-frame loading correlation (circles) and shuffled paired-frame loading correlation mean (solid bar) and 95% confidence intervals (lighter bars) across all sentences for one actor and one sentence for nine actors. Actor color code used in this figure is identical to that used in Fig. 2C.

**Fig. 4.**
A Bubbles analysis reveals the regions of the vocal tract that are important for reconstruction of the face and vice versa, across the whole sentence. (A) Example frame for actor 8 and sentence 1, with a random bubble mask applied to the single video frame (mask was also applied to the warp vector fields which, for clarity, are not depicted here). The bars to the right show the 1D serial vector for this frame, taken column-wise from the RGB pixel values in the image for the original PCA input (*Left* bar) and once the bubble mask has been applied to the reconstruction input (*Right* bar). (B) *ProportionPlanes* overlaid onto the first frame from the MR sequence (for display purposes only) for three randomly selected actors. (C) *ProportionPlanes* overlaid onto the first frame from the video sequence (for display purposes only) for the three actors. In both B and C, the hotter the color, the more that region contributed to the top 10% of reconstructions, based on the loading SSE.

**Fig. 5.**
Frame-by-frame Bubbles analysis for selected phrases in the example sentence (indicated in bold beside each row) for three randomly selected actors (indicated above each column). *ProportionPlanes* overlaid onto each frame: the hotter the color, the more that region contributed to the top 10% of reconstructions for that frame, based on the loading SSE.

See this image and copyright information in PMC

Cited by

Neural indicators of articulator-specific sensorimotor influences on infant speech perception.
Choi D, Dehaene-Lambertz G, Peña M, Werker JF. Choi D, et al. Proc Natl Acad Sci U S A. 2021 May 18;118(20):e2025043118. doi: 10.1073/pnas.2025043118. Proc Natl Acad Sci U S A. 2021. PMID: 33980713 Free PMC article.
Faces and Voices Processing in Human and Primate Brains: Rhythmic and Multimodal Mechanisms Underlying the Evolution and Development of Speech.
Michon M, Zamorano-Abramson J, Aboitiz F. Michon M, et al. Front Psychol. 2022 Mar 30;13:829083. doi: 10.3389/fpsyg.2022.829083. eCollection 2022. Front Psychol. 2022. PMID: 35432052 Free PMC article. Review.
Modulation transfer functions for audiovisual speech.
Pedersen NF, Dau T, Hansen LK, Hjortkjær J. Pedersen NF, et al. PLoS Comput Biol. 2022 Jul 19;18(7):e1010273. doi: 10.1371/journal.pcbi.1010273. eCollection 2022 Jul. PLoS Comput Biol. 2022. PMID: 35852989 Free PMC article.
A PCA-Based Active Appearance Model for Characterising Modes of Spatiotemporal Variation in Dynamic Facial Behaviours.
Watson DM, Johnston A. Watson DM, et al. Front Psychol. 2022 May 26;13:880548. doi: 10.3389/fpsyg.2022.880548. eCollection 2022. Front Psychol. 2022. PMID: 35719501 Free PMC article.

References

1. Altieri N., Hudock D., Hearing impairment and audiovisual speech integration ability: A case study report. Front. Psychol. 5, 678 (2014). - PMC - PubMed
1. Erber N. P., Use of hearing aids by older people: Influence of non-auditory factors (vision, manual dexterity). Int. J. Audiol. 42 (suppl. 2), S21–S25 (2003). - PubMed
1. Sumby W. H., Pollack I., Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26, 212–215 (1954).
1. Cherry E. C., Some experiments on the recognition of speech, with one and with 2 ears. J. Acoust. Soc. Am. 25, 975–979 (1953).
1. Peelle J. E., Sommers M. S., Prediction and constraint in audiovisual speech perception. Cortex 68, 169–181 (2015). - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

DH_/Department of Health/United Kingdom

LinkOut - more resources

Full Text Sources

[1] Altieri N., Hudock D., Hearing impairment and audiovisual speech integration ability: A case study report. Front. Psychol. 5, 678 (2014). - PMC - PubMed

[2] Altieri N., Hudock D., Hearing impairment and audiovisual speech integration ability: A case study report. Front. Psychol. 5, 678 (2014). - PMC - PubMed

[3] Erber N. P., Use of hearing aids by older people: Influence of non-auditory factors (vision, manual dexterity). Int. J. Audiol. 42 (suppl. 2), S21–S25 (2003). - PubMed

[4] Erber N. P., Use of hearing aids by older people: Influence of non-auditory factors (vision, manual dexterity). Int. J. Audiol. 42 (suppl. 2), S21–S25 (2003). - PubMed

[5] Sumby W. H., Pollack I., Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26, 212–215 (1954).

[6] Sumby W. H., Pollack I., Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26, 212–215 (1954).

[7] Cherry E. C., Some experiments on the recognition of speech, with one and with 2 ears. J. Acoust. Soc. Am. 25, 975–979 (1953).

[8] Cherry E. C., Some experiments on the recognition of speech, with one and with 2 ears. J. Acoust. Soc. Am. 25, 975–979 (1953).

[9] Peelle J. E., Sommers M. S., Prediction and constraint in audiovisual speech perception. Cortex 68, 169–181 (2015). - PMC - PubMed

[10] Peelle J. E., Sommers M. S., Prediction and constraint in audiovisual speech perception. Cortex 68, 169–181 (2015). - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The interrelationship between the face and vocal tract configuration during audiovisual speech

Affiliations

The interrelationship between the face and vocal tract configuration during audiovisual speech

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources