Keywords

1 Introduction

Audio Visual Speech Recognition is a fundamental task in Multimodal Human Computer Interfaces, where the acoustic and visual information (mouth movements, facial gestures, etc.) during speech are taken into account. Several strategies have been proposed in the literature for AVSR [11,12,13], where improvements of the recognition rates are achieved by fusing audio and visual features related to speech. As expected, these improvements are more notorious when the audio channel is corrupted by noise, which is a usual situation in speech recognition applications. These strategies usually differ in the way the audio and visual information is extracted and combined, and the AV-Model employed to represent the audio-visual information. These approaches are usually classified according to the method employed to combine (or fuse) the audio and visual information, viz., feature level fusion, classifier level fusion and decision level fusion [5].

In feature level fusion (a.k.a. early integration), audio and visual features are combined to form a unique audio-visual feature vector, which is then employed for the classification task [2, 7]. This strategy is effective when the combined modalities are correlated, since it can exploit the covariations between the audio and video features. This method requires the audio and visual features to be exactly at the same rate and in synchrony, and usually performs a dimensionality reduction stage, in order to avoid large dimensionality of the resulting feature vectors. In the case of classifier level fusion (a.k.a. intermediate integration), the information is combined within the classifier using separate audio and visual streams, in order to generate a composite classifier to process the individual data streams [3, 10, 11]. This strategy has the advantage of being able to handle possible asynchrony between audio and visual features. In decision level fusion (a.k.a. late integration), independent classifiers are used for each modality and the final decision is computed by the combination of the likelihood scores associated with each classifier [6, 9]. This strategy does not require strictly synchronized streams. Different techniques to perform decision level fusion have been proposed. The most commonly used is to combine the matching scores of the individual classifiers with simple rules, such as, max, min, product, simple sum, or weighted sum.

In this paper, a decision level fusion strategy for audio-visual speech recognition in noisy situations is proposed. The combination of the scores is performed by taking into account the reliability of each modality at different noisy conditions. The performance of the proposed recognition system is evaluated over two audio-visual databases, considering two types of acoustic noise and different classification methods.

The rest of this paper is organized as follows. The description of the proposed system is given in Sect. 2, and the databases used for the experiments are described in Sect. 3. In Sect. 4 experimental results are presented and the performance of the proposed strategy is analyzed. Finally, some concluding remarks are included in Sect. 5.

2 Proposed Approach

Figure 1 shows a schematic representation of the proposed audio-visual speech classification system. The recognition is performed by taking into account three classifiers trained on audio, visual and audio-visual information, hereafter referred as \(\lambda _a\), \(\lambda _v\) and \(\lambda _{av}\), respectively. Given an audio-visual observation \(O_{av}\) associated with the input word to be recognized, which can be partitioned into acoustic and visual parts, denoted as \(O_a\) and \(O_v\), respectively, the probability (or score) vectors \(\mathbf {P}\left( O_a|\lambda ^a\right) \), \(\mathbf {P}\left( O_v|\lambda ^v\right) \) and \(\mathbf {P}\left( O_{av}|\lambda ^{av}\right) \) are computed from the audio, visual and audio-visual classifiers, respectively. These vectors are composed by the concatenation of the probabilities associated to each class in the dictionary. Then, the fused probability vector \(\mathbf {P}_F\left( O_{av}\right) \) is defined as

$$\begin{aligned} \mathbf {P}_F \left( O_{av}\right) = c_a * \mathbf {P}\left( O_a|\lambda ^a \right) + c_v * \mathbf {P}\left( O_v|\lambda ^v\right) + c_{av} * \mathbf {P}\left( O_{av}|\lambda ^{av}\right) , \end{aligned}$$
(1)

where \(c_a\), \(c_v\) and \(c_{av}\) are the normalized \((c_a+c_v +c_{av}=1)\) reliability coefficients associated to the confidence of the \(\lambda ^a\), \(\lambda ^v\) and \(\lambda ^{av}\) classifiers, respectively. These reliability coefficients are computed by taking into account the relative efficiency of each modality, and they are given by the following equations,

$$\begin{aligned} c_a = \frac{\alpha }{\alpha +\beta +\gamma }&,\qquad&\alpha =\mathrm {exp}\left( \frac{(s_a-s_v) + (s_a-s_{av})}{s_a} \right) , \\ c_v = \frac{\beta }{\alpha +\beta +\gamma }&,\qquad&\beta = \mathrm {exp}\left( \frac{(s_v-s_a) + (s_v-s_{av})}{s_v} \right) , \\ c_{av} = \frac{\gamma }{\alpha +\beta +\gamma }&,\qquad&\gamma =\mathrm {exp}\left( \frac{(s_{av}-s_a) + (s_{av}-s_v)}{s_{av}} \right) , \end{aligned}$$

where \(s_a\), \(s_v\) and \(s_{av}\) are the confidence factors associated with the audio, visual and audio-visual classifiers, which are computed during the training stage as the recognition rates of the classifiers over a training dataset. Finally, the input data is recognized as the class with the maximum fused probability. In order to employ this recognition scheme at different noisy conditions, the reliability coefficients \(c_a\), \(c_v\) and \(c_{av}\) are computed for different SNRs.

Fig. 1.
figure 1

Schematic representation of the proposed audio-visual speech classification system.

3 Audio-Visual Databases

The performance of the proposed classification scheme is evaluated over two isolated word audio-visual databases, viz., a database compiled by the authors, hereafter referred as AV-UNR database and the Carnegie Mellon University (AV-CMU) database (now at Cornell University) [1].

(I) AV-UNR Database: The AV-UNR database consists of videos of 16 speakers, pronouncing a set of ten words (up, down, right, left, forward, back, stop, save, open and close) 20 times. The audio features are represented by the first eleven non-DC Mel-Cepstral coefficients, and its associated first and second derivative coefficients. Visual features are represented by three parameters, viz., mouth height, mouth width and area between lips.

(II) AV-CMU Database: The AV-CMU database [1] consists of ten speakers, with each of them saying the digits from 0 to 9 ten times. The audio features are represented by the same parameters as in AV-UNR database. To represent the visual information, the weighted least-squares parabolic fitting method proposed in [2] is employed in this paper. Visual features are represented by five parameters, viz, the focal parameters of the upper and lower parabolas, mouth’s width and height, and the main angle of the bounding rectangle of the mouth.

4 Experimental Results

The proposed decision level fusion strategy for audio-visual speech recognition is tested separately on the databases described in Sect. 3. For the evaluation of the performance of the proposed system over each database, audio-visual features are extracted from videos where the acoustic and visual streams are synchronized. The audio signal is partitioned in frames with the same rate as the video frame rate. The audio parameters at a given frame t is represented by the first eleven non-DC Mel-Cepstral coefficients, and its associated first and second derivative coefficients, computed from this frame. For the case of considering audio-visual information, the audio-visual feature vector at frame t is composed by the concatenation of the acoustic parameters with the visual ones.

Fig. 2.
figure 2

Recognition rates for different SNRs over the AV-CMU database for the cases of considering additive Gaussian noise (left column) and Babble noise (right column), and HMM (first row), SVM (second row) and RF (third row) classifiers. For each case, the recognition rates obtained with audio, visual and audio-visual classifiers are depicted in red, grey and blue solid lines, respectively. The performance obtained with the proposed fusion rule (CFR) is depicted in green solid lines, while the corresponding performances obtained with sum and product fusion rules are depicted in cyan and magenta dashed lines, respectively. (Color figure online)

To evaluate the recognition rates under noisy acoustic conditions, experiments with additive Gaussian and additive Babble noise, with SNRs ranging from \(-10\) dB to 40 dB, were performed. Multispeaker or Babble noise environment is one of the most challenging noise conditions, since the interference is speech from other speakers. This noise is uniquely challenging because of its highly time evolving structure and its similarity to the desired target speech [8]. In this paper, Babble noise samples were extracted from NOISEX-92 database, compiled by the Digital Signal Processing (DSP) group of the Rice University [4]. In addition, this evaluation is performed by considering three different types of classification methods, Hidden Markov Models, Random Forests and Support Vector Machines, respectively. Thus, the proposed confidence-based fusion rule (CFR) is evaluated in twelve different experiments (2 databases, 2 noise conditions and 3 classification methods).

Fig. 3.
figure 3

Recognition rates for different SNRs over the AV-UNR database for the cases of considering additive Gaussian noise (left column) and Babble noise (right column), and HMM (first row), SVM (second row) and RF (third row) classifiers. For each case, the recognition rates obtained with audio, visual and audio-visual classifiers are depicted in red, grey and blue solid lines, respectively. The performance obtained with the proposed fusion rule (CFR) is depicted in green solid lines, while the corresponding performances obtained with sum and product fusion rules are depicted in cyan and magenta dashed lines, respectively. (Color figure online)

In order to obtain statistically significant results, a 5-fold cross-validation (CV) is performed at each experiment. In the training stage, the classifiers meta-parameters for each modality are computed with the training set without noise, while the noisy training set is used to compute the reliability coefficients associated to these classifiers at the different noise levels. Then, the noisy Testing set is used to evaluate the performance of the recognition system at each SNR.

The recognition rates obtained for the experiments over the AV-CMU database for different SNRs are depicted in Fig. 2. As it expected, for the six cases, the efficiency of the classifier based on audio-only information deteriorates as the SNR decreases, while the efficiency of the visual-only information classifier remains constant, since it does not depend on the SNR in the acoustic channel. The efficiency of the audio-visual classifier is better than the one for audio classifier, but it also deteriorates at low SNRs. In this figure, are also depicted the performances obtained with the proposed decision level fusion strategy based on the confidence of each modality at different noisy conditions (CFR), and the ones obtained with the sum and product fusion rules. It can be seen, that the proposed fusion rule leads to improvements in the recognition rates, which are more notorious at low SNR levels. Moreover, the recognition rates obtained for the experiments over the AV-UNR database for different SNRs are depicted in Fig. 3. In this case, it can be also seen that a good performance is achieved with the proposed recognition scheme. On the other hand, the recognition rates obtained with the proposed method over the AV-CMU database for the case of considering Gaussian additive noise and different classifiers are compared with the performance reported in [2], computed over the same database and noisy situation. This comparison is depicted in Fig. 4. It shows that, for the cases of considering RF and SVM classifiers, the proposed method outperforms the one reported in [2], whereas that for case of using HMM classifiers, the improvement is more notorious from 10 dB SNR to clean condition. All these experiments show that the proposed decision level fusion rule performs well for different noisy situations, classification methods and databases, in most cases outperforming the typical sum and product fusion rules.

Fig. 4.
figure 4

Comparison between the recognition rates obtained, over the AV-CMU database, with proposed method using HMM, RF, and SVM classifiers, and the recognition rates reported in [2].

5 Conclusions

In this paper, a decision level fusion strategy for audio-visual speech recognition in noisy situations was proposed. In order to enhance the recognition over different noisy conditions, the scores obtained with classifiers trained with different feature sets are fused. The combination of the scores is performed by taking into account the reliability of each modality at different noisy conditions. This method was evaluated by considering three modalities, audio, visual and audio-visual, respectively, but it could be employed using as many modalities as needed. This evaluation was carried out over two isolated word audio-visual databases, a public one and a database compiled by the authors of this paper. The proposed decision level fusion strategy was in addition evaluated by considering HMM, RF and SVM classifier. Experimental results showed that a good performance is achieved with the proposed system, leading to improvements in the recognition rates through a wide range of signal-to-noise ratios, in comparison with other method in the literature.