Abstract
Output-based instrumental speech quality assessment relies only on the received (processed) signal to predict quality. Such methods are called non-intrusive and are crucial in speech applications where reference clean signals are not accessible. In this paper, we propose a new non-intrusive instrumental quality measure based on the similarity between two i-vectors. As the reference clean signal is not available, the reference i-vector representation cannot be extracted directly from it. Therefore, we propose the use of a clean speech Gaussian mixture model to estimate the clean speech spectra from its degraded speech spectrum counterpart. Next, the two respective i-vector representations are extracted and either the cosine or Eucledian similarity metrics are computed as a correlate of speech quality. Here, the clean speech model is trained using RASTA-filtered mel-frequency cepstral coefficients extracted from a pool of clean speech files, thus allowing us to attain a model of clean spectrum characteristics. The proposed method is evaluated on noisy, reverberant, and enhanced speech conditions. Experimental results show the proposed system providing higher correlations with perceptual speech quality than several benchmark non-intrusive measures, especially for noisy and enhanced speech.
Similar content being viewed by others
Notes
In the case of P.563, speech samples were resampled to 8 kHz.
Out-of-scope usage of POLQA as input and reference signals are wideband whereas reference signals are expected to be superwideband [36].
References
Liotou E et al (2015) Quality of experience management in mobile cellular networks: key issues and design challenges. IEEE Commun Mag 53(7):145–153
Cauchi B et al (2016) Perceptual and instrumental evaluation of the perceived level of reverberation. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 629–633. IEE
Gastaldo P, Zunino R, Redi J (2013) Supporting visual quality assessment with machine learning. EURASIP J Image Video Process 2013(1):54
Issa O et al (2012) Quality-of-experience perception for video streaming services: Preliminary subjective and objective results. In Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, pp 1–9. IEEE
Jin C, Kubichek R (1996) Vector quantization techniques for output-based objective speech quality. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 1, pp 491–494. IEEE
Möller S et al (2011) Speech quality estimation: models and trends. IEEE Signal Process Mag 28(6):18–28
ITU-T Recommendation P.563. (2004) Single-ended method for objective speech quality assessment in narrow-band telphony applications
Falk TH et al (2015) Objective quality and intelligibility prediction for users of assistive listening devices: advantages and limitations of existing tools. IEEE Signal Process Mag 32(2):114–124
Avila AR et al (2019) Non-intrusive speech quality assessment using neural networks. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 631–635. IEEE
Gamper H et al (2019) Intrusive and non-intrusive perceptual speech quality assessment using a convolutional neural network. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp 85–89. IEEE
Soni MH, Patil HA (2016) Novel subband autoencoder features for non-intrusive quality assessment of noise suppressed speech. In INTERSPEECH, pp 3708–3712
Cauchi B et al (2019) Non-intrusive speech quality prediction using modulation energies and lstm-network. IEEE/ACM Trans Audio Speech Lang Process 27(7):1151–1163
Avila AR et al (2019) Blind channel response estimation for replay attack detection. Proc. Interspeech 2893–2897
Avila AR et al (2019) Intrusive quality measurement of noisy and enhanced speech based on i-vector similarity. In Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pp 1–5. IEEE
Hermansky H, Morgan N (1994) Rasta processing of speech. IEEE Trans Speech Audio Process 2(4):578–589
Falk TH, Chan W-Y (2006) Single-ended speech quality measurement using machine learning methods. IEEE Trans Audio Speech Lang Process 14(6):1935–1947
Gaubitch ND, Brookes M, Naylor AA (2013) Blind channel magnitude response estimation in speech using spectrum classification. IEEE Trans Audio Speech Lang Process 21(10):2162–2171
Kenny P et al (2007) Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans Audio Speech Lang Process 15(4):1435–1447
Dehak N et al (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Hansen JHL, Hasan T (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process Mag 32(6):74–99
Kenny P, Boulianne G, Dumouchel P (2005) Eigenvoice modeling with sparse training data. IEEE Trans Speech Audio Process 13(3):345–354
Garcia-Romero D, Espy-Wilson CY (2011) Analysis of i-vector length normalization in speaker recognition systems. In Twelfth Annual Conference of the International Speech Communication Association
Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. Journal of Machine Learning Research 9(Nov)
Series B (2014) Recommendation itu-r bs. 1534-3 method for the subjective assessment of intermediate quality level of audio systems. International Telecommunication Union Radio Communication Assembly
Santos JF, Falk TH (2019) Towards the development of a non-intrusive objective quality measure for dnn-enhanced speech. In 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pp 1–6. IEEE
Schoeffler M et al (2018) webmushra-a comprehensive framework for web-based listening tests. J Open Res Softw 6(1)
Valentini-Botinhao C et al (2017) Noisy speech database for training speech enhancement algorithms and tts models. University of Edinburgh. School of Informatics, Centre for Speech Technology Research (CSTR)
Pascual S, Bonafonte A, Serrà J (2017) Segan: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452
Veaux C, Yamagishi J, King S (2013) The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp 1–4. IEEE
Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251
Santos JF, Falk TH (2018) Speech dereverberation with context-aware recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process 26(7):1236–1246
Williamson DS, Wang D (2017) Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans Audio Speech Lang Processing 25(7):1492–1501
Wu B et al (2016) A reverberation-time-aware approach to speech dereverberation based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process 25(1):102–111
ITU-T Recommendation P.862. Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs, February 2001
Rix A, Beerends J, Hollier M, Hekstra A (2001) Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs. pp 749–752
ITU-T Recommendation P.863. (2008) Perceptual objective listening quality assessment: An advanced objective perceptual method for end-to-end listening speech quality evaluation of fixed, mobile, and IP-based networks and speech codecs covering narrowband, wideband, and super-wideband signals. Technical report
Beerends J et al (2013) Perceptual objective listening quality assessment (polqa), the third generation itu-t standard for end-to-end speech quality measurement part ii: Perceptual model. Audio Eng Soc 61(6)
Ma J, Hu Y, Loizou PC (2009) Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. J Acoust Soc Am 125(5):3387–3405
Janssen JH (1957) A method for the calculation of the speech intelligibility under conditions of reverberation and noise. Acta Acustica united with Acustica 7(5):305–310
Taal CH et al (2010) A short-time objective intelligibility measure for time-frequency weighted noisy speech. pp 4214–4217
Malfait L, Berger J, Kastner M (2006) P.563 - the ITU-T standard for single-ended speech quality assessment. IEEE Trans Audio Speech Lang Process 14(6):1924–1934
ITU-T Recommendation P.830. (1996) Subjective performance assessment of telephone-band and wideband digital codecs
Falk TH, Zheng C, Chan W-Y (2010) A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech. IEEE Trans Audio Speech Lang Process 18(7):1766–1774
Santos JF, Senoussaoui M, Falk TH (2014) An improved non-intrusive intelligibility metric for noisy and reverberant speech. pp 55–59
Fu SW et al (2018) Quality-net: An end-to-end non-intrusive speech quality assessment model based on blstm. arXiv preprint arXiv:1808.05344
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436
ITU-T Recommendation P.800. (1998) Recommendation P.800: Methods for subjectiuve determination of transmission quality
Rix AW (2003) Comparison between subjective listening quality and p. 862 pesq score. Proc. Measurement of Speech and Audio Quality in Networks (MESAQIN’03), Prague, Czech Republic
Shcherbakov MV et al (2013) A survey of forecast error measures. World Appl Sci J 24(24):171–176
ITU-T Recommendation P.862.1. (2003) Mapping function for transforming p.862 raw result scores to mos-lq
Falk TH, Chan W-Y (2010) Temporal dynamics for blind measurement of room acoustical parameters. IEEE Trans Instrum Meas 59(4):978–989
Kenny P et al (2008) A study of interspeaker variability in speaker verification. IEEE Trans Audio Speech Lang Process 16(5):980–988
ITU-T Recommendation P.835. (2003) Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm. International Telecommunication Union, Geneva
Cauchi B et al (2015) Combination of MVDR beamforming and single-channel spectral processing for enhancing noisy and reverberant speech. EURASIP J Adv Signal Process 2015:1–12
Thiemann J et al (2016) Speech enhancement for multimicrophone binaural hearing aids aiming to preserve the spatial auditory scene. EURASIP J Adv Signal Process 2016(1):12
Acknowledgements
The authors would like to thank the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), the Fonds de recherche du Québec - Nature et Technologies (FRQNT), and the Natural Sciences and Engineering Research Council of Canada (NSERC) for their financial support.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Avila, A.R., O’Shaughnessy, D. & Falk, T.H. Non-intrusive speech quality prediction based on the blind estimation of clean speech and the i-vector framework. Qual User Exp 5, 11 (2020). https://doi.org/10.1007/s41233-020-00040-3
Received:
Published:
DOI: https://doi.org/10.1007/s41233-020-00040-3