Abstract
This paper proposes a decision level fusion strategy for audio-visual speech recognition in noisy situations. This method aims to enhance the recognition over different noisy conditions by fusing the scores obtained with classifiers trained with different feature sets. In particular, this method is evaluated by considering three modalities, audio, visual and audio-visual, respectively, but it could be employed using as many modalities as needed. The combination of the scores is performed by taking into account the reliability of each modality at different noisy conditions. The performance of the proposed recognition system is evaluated over two isolated word audio-visual databases, a public one and a database compiled by the authors of this paper. The proposed decision level fusion strategy is evaluated by considering different kind of classifier. Experimental results show that a good performance is achieved with the proposed system, leading to improvements in the recognition rates through a wide range of signal-to-noise ratios.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Audio Visual Speech Recognition is a fundamental task in Multimodal Human Computer Interfaces, where the acoustic and visual information (mouth movements, facial gestures, etc.) during speech are taken into account. Several strategies have been proposed in the literature for AVSR [11,12,13], where improvements of the recognition rates are achieved by fusing audio and visual features related to speech. As expected, these improvements are more notorious when the audio channel is corrupted by noise, which is a usual situation in speech recognition applications. These strategies usually differ in the way the audio and visual information is extracted and combined, and the AV-Model employed to represent the audio-visual information. These approaches are usually classified according to the method employed to combine (or fuse) the audio and visual information, viz., feature level fusion, classifier level fusion and decision level fusion [5].
In feature level fusion (a.k.a. early integration), audio and visual features are combined to form a unique audio-visual feature vector, which is then employed for the classification task [2, 7]. This strategy is effective when the combined modalities are correlated, since it can exploit the covariations between the audio and video features. This method requires the audio and visual features to be exactly at the same rate and in synchrony, and usually performs a dimensionality reduction stage, in order to avoid large dimensionality of the resulting feature vectors. In the case of classifier level fusion (a.k.a. intermediate integration), the information is combined within the classifier using separate audio and visual streams, in order to generate a composite classifier to process the individual data streams [3, 10, 11]. This strategy has the advantage of being able to handle possible asynchrony between audio and visual features. In decision level fusion (a.k.a. late integration), independent classifiers are used for each modality and the final decision is computed by the combination of the likelihood scores associated with each classifier [6, 9]. This strategy does not require strictly synchronized streams. Different techniques to perform decision level fusion have been proposed. The most commonly used is to combine the matching scores of the individual classifiers with simple rules, such as, max, min, product, simple sum, or weighted sum.
In this paper, a decision level fusion strategy for audio-visual speech recognition in noisy situations is proposed. The combination of the scores is performed by taking into account the reliability of each modality at different noisy conditions. The performance of the proposed recognition system is evaluated over two audio-visual databases, considering two types of acoustic noise and different classification methods.
The rest of this paper is organized as follows. The description of the proposed system is given in Sect. 2, and the databases used for the experiments are described in Sect. 3. In Sect. 4 experimental results are presented and the performance of the proposed strategy is analyzed. Finally, some concluding remarks are included in Sect. 5.
2 Proposed Approach
Figure 1 shows a schematic representation of the proposed audio-visual speech classification system. The recognition is performed by taking into account three classifiers trained on audio, visual and audio-visual information, hereafter referred as \(\lambda _a\), \(\lambda _v\) and \(\lambda _{av}\), respectively. Given an audio-visual observation \(O_{av}\) associated with the input word to be recognized, which can be partitioned into acoustic and visual parts, denoted as \(O_a\) and \(O_v\), respectively, the probability (or score) vectors \(\mathbf {P}\left( O_a|\lambda ^a\right) \), \(\mathbf {P}\left( O_v|\lambda ^v\right) \) and \(\mathbf {P}\left( O_{av}|\lambda ^{av}\right) \) are computed from the audio, visual and audio-visual classifiers, respectively. These vectors are composed by the concatenation of the probabilities associated to each class in the dictionary. Then, the fused probability vector \(\mathbf {P}_F\left( O_{av}\right) \) is defined as
where \(c_a\), \(c_v\) and \(c_{av}\) are the normalized \((c_a+c_v +c_{av}=1)\) reliability coefficients associated to the confidence of the \(\lambda ^a\), \(\lambda ^v\) and \(\lambda ^{av}\) classifiers, respectively. These reliability coefficients are computed by taking into account the relative efficiency of each modality, and they are given by the following equations,
where \(s_a\), \(s_v\) and \(s_{av}\) are the confidence factors associated with the audio, visual and audio-visual classifiers, which are computed during the training stage as the recognition rates of the classifiers over a training dataset. Finally, the input data is recognized as the class with the maximum fused probability. In order to employ this recognition scheme at different noisy conditions, the reliability coefficients \(c_a\), \(c_v\) and \(c_{av}\) are computed for different SNRs.
3 Audio-Visual Databases
The performance of the proposed classification scheme is evaluated over two isolated word audio-visual databases, viz., a database compiled by the authors, hereafter referred as AV-UNR database and the Carnegie Mellon University (AV-CMU) database (now at Cornell University) [1].
(I) AV-UNR Database: The AV-UNR database consists of videos of 16 speakers, pronouncing a set of ten words (up, down, right, left, forward, back, stop, save, open and close) 20 times. The audio features are represented by the first eleven non-DC Mel-Cepstral coefficients, and its associated first and second derivative coefficients. Visual features are represented by three parameters, viz., mouth height, mouth width and area between lips.
(II) AV-CMU Database: The AV-CMU database [1] consists of ten speakers, with each of them saying the digits from 0 to 9 ten times. The audio features are represented by the same parameters as in AV-UNR database. To represent the visual information, the weighted least-squares parabolic fitting method proposed in [2] is employed in this paper. Visual features are represented by five parameters, viz, the focal parameters of the upper and lower parabolas, mouth’s width and height, and the main angle of the bounding rectangle of the mouth.
4 Experimental Results
The proposed decision level fusion strategy for audio-visual speech recognition is tested separately on the databases described in Sect. 3. For the evaluation of the performance of the proposed system over each database, audio-visual features are extracted from videos where the acoustic and visual streams are synchronized. The audio signal is partitioned in frames with the same rate as the video frame rate. The audio parameters at a given frame t is represented by the first eleven non-DC Mel-Cepstral coefficients, and its associated first and second derivative coefficients, computed from this frame. For the case of considering audio-visual information, the audio-visual feature vector at frame t is composed by the concatenation of the acoustic parameters with the visual ones.
To evaluate the recognition rates under noisy acoustic conditions, experiments with additive Gaussian and additive Babble noise, with SNRs ranging from \(-10\) dB to 40 dB, were performed. Multispeaker or Babble noise environment is one of the most challenging noise conditions, since the interference is speech from other speakers. This noise is uniquely challenging because of its highly time evolving structure and its similarity to the desired target speech [8]. In this paper, Babble noise samples were extracted from NOISEX-92 database, compiled by the Digital Signal Processing (DSP) group of the Rice University [4]. In addition, this evaluation is performed by considering three different types of classification methods, Hidden Markov Models, Random Forests and Support Vector Machines, respectively. Thus, the proposed confidence-based fusion rule (CFR) is evaluated in twelve different experiments (2 databases, 2 noise conditions and 3 classification methods).
In order to obtain statistically significant results, a 5-fold cross-validation (CV) is performed at each experiment. In the training stage, the classifiers meta-parameters for each modality are computed with the training set without noise, while the noisy training set is used to compute the reliability coefficients associated to these classifiers at the different noise levels. Then, the noisy Testing set is used to evaluate the performance of the recognition system at each SNR.
The recognition rates obtained for the experiments over the AV-CMU database for different SNRs are depicted in Fig. 2. As it expected, for the six cases, the efficiency of the classifier based on audio-only information deteriorates as the SNR decreases, while the efficiency of the visual-only information classifier remains constant, since it does not depend on the SNR in the acoustic channel. The efficiency of the audio-visual classifier is better than the one for audio classifier, but it also deteriorates at low SNRs. In this figure, are also depicted the performances obtained with the proposed decision level fusion strategy based on the confidence of each modality at different noisy conditions (CFR), and the ones obtained with the sum and product fusion rules. It can be seen, that the proposed fusion rule leads to improvements in the recognition rates, which are more notorious at low SNR levels. Moreover, the recognition rates obtained for the experiments over the AV-UNR database for different SNRs are depicted in Fig. 3. In this case, it can be also seen that a good performance is achieved with the proposed recognition scheme. On the other hand, the recognition rates obtained with the proposed method over the AV-CMU database for the case of considering Gaussian additive noise and different classifiers are compared with the performance reported in [2], computed over the same database and noisy situation. This comparison is depicted in Fig. 4. It shows that, for the cases of considering RF and SVM classifiers, the proposed method outperforms the one reported in [2], whereas that for case of using HMM classifiers, the improvement is more notorious from 10 dB SNR to clean condition. All these experiments show that the proposed decision level fusion rule performs well for different noisy situations, classification methods and databases, in most cases outperforming the typical sum and product fusion rules.
5 Conclusions
In this paper, a decision level fusion strategy for audio-visual speech recognition in noisy situations was proposed. In order to enhance the recognition over different noisy conditions, the scores obtained with classifiers trained with different feature sets are fused. The combination of the scores is performed by taking into account the reliability of each modality at different noisy conditions. This method was evaluated by considering three modalities, audio, visual and audio-visual, respectively, but it could be employed using as many modalities as needed. This evaluation was carried out over two isolated word audio-visual databases, a public one and a database compiled by the authors of this paper. The proposed decision level fusion strategy was in addition evaluated by considering HMM, RF and SVM classifier. Experimental results showed that a good performance is achieved with the proposed system, leading to improvements in the recognition rates through a wide range of signal-to-noise ratios, in comparison with other method in the literature.
References
Advanced Multimedia Processing Laboratory. Carnegie Mellon University, Pittsburgh. http://chenlab.ece.cornell.edu/projects/AudioVisualSpeechProcessing/
Borgström, B., Alwan, A.: A low-complexity parabolic lip contour model with speaker normalization for high-level feature extraction in noise-robust audiovisual speech recognition. IEEE Trans. Syst. Man Cybern. 38(6), 1273–1280 (2008)
Chu, S.M., Huang, T.S.: Audio-visual speech fusion using coupled hidden Markov models. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2007)
Digital Signal Processing (DSP) Group. Rice University: NOISEX-92 Database, Houston
Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2(3), 141–151 (2000)
Estellers, V., Gurban, M., Thiran, J.: On dynamic stream weighting for audio-visual speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(4), 1145–1157 (2012)
Kashiwagi, Y., Suzuki, M., Minematsu, N., Hirose, K.: Audio-visual feature integration based on piecewise linear transformation for noise robust automatic speech recognition. In: 2012 IEEE Spoken Language Technology Workshop (SLT), Miami, FL, USA, 2–5 December 2012, pp. 149–152 (2012)
Krishnamurthy, N., Hansen, J.: Babble noise: modeling, analysis, and applications. IEEE Trans. Audio Speech Lang. Process. 17(7), 1394–1407 (2009)
Lee, J.S., Park, C.H.: Robust audio-visual speech recognition based on late integration. IEEE Trans. Multimedia 10(5), 767–779 (2008)
Nefian, A.V., Liang, L., Pi, X., Xiaoxiang, L., Mao, C., Murphy, K.: A coupled HMM for audio-visual speech recognition. In: International Conference on Acoustics, Speech and Signal Processing (CASSP02), pp. 2013–2016 (2002)
Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. Trans. Audio Speech Lang. Process. 17(3), 423–435 (2009)
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE 91, 1306–1326 (2003)
Shivappa, S., Trivedi, M., Rao, B.: Audiovisual information fusion in human computer interfaces and intelligent environments: a survey. Proc. IEEE 98(10), 1692–1715 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Sad, G.D., Terissi, L.D., Gómez, J.C. (2017). Decision Level Fusion for Audio-Visual Speech Recognition in Noisy Conditions. In: Beltrán-Castañón, C., Nyström, I., Famili, F. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2016. Lecture Notes in Computer Science(), vol 10125. Springer, Cham. https://doi.org/10.1007/978-3-319-52277-7_44
Download citation
DOI: https://doi.org/10.1007/978-3-319-52277-7_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52276-0
Online ISBN: 978-3-319-52277-7
eBook Packages: Computer ScienceComputer Science (R0)