Non-intrusive speech quality prediction based on the blind estimation of clean speech and the i-vector framework | Quality and User Experience Skip to main content
Log in

Non-intrusive speech quality prediction based on the blind estimation of clean speech and the i-vector framework

  • Research Article
  • Published:
Quality and User Experience Aims and scope Submit manuscript

Abstract

Output-based instrumental speech quality assessment relies only on the received (processed) signal to predict quality. Such methods are called non-intrusive and are crucial in speech applications where reference clean signals are not accessible. In this paper, we propose a new non-intrusive instrumental quality measure based on the similarity between two i-vectors. As the reference clean signal is not available, the reference i-vector representation cannot be extracted directly from it. Therefore, we propose the use of a clean speech Gaussian mixture model to estimate the clean speech spectra from its degraded speech spectrum counterpart. Next, the two respective i-vector representations are extracted and either the cosine or Eucledian similarity metrics are computed as a correlate of speech quality. Here, the clean speech model is trained using RASTA-filtered mel-frequency cepstral coefficients extracted from a pool of clean speech files, thus allowing us to attain a model of clean spectrum characteristics. The proposed method is evaluated on noisy, reverberant, and enhanced speech conditions. Experimental results show the proposed system providing higher correlations with perceptual speech quality than several benchmark non-intrusive measures, especially for noisy and enhanced speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. In the case of P.563, speech samples were resampled to 8 kHz.

  2. Out-of-scope usage of POLQA as input and reference signals are wideband whereas reference signals are expected to be superwideband [36].

References

  1. Liotou E et al (2015) Quality of experience management in mobile cellular networks: key issues and design challenges. IEEE Commun Mag 53(7):145–153

    Article  Google Scholar 

  2. Cauchi B et al (2016) Perceptual and instrumental evaluation of the perceived level of reverberation. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 629–633. IEE

  3. Gastaldo P, Zunino R, Redi J (2013) Supporting visual quality assessment with machine learning. EURASIP J Image Video Process 2013(1):54

    Article  Google Scholar 

  4. Issa O et al (2012) Quality-of-experience perception for video streaming services: Preliminary subjective and objective results. In Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, pp 1–9. IEEE

  5. Jin C, Kubichek R (1996) Vector quantization techniques for output-based objective speech quality. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 1, pp 491–494. IEEE

  6. Möller S et al (2011) Speech quality estimation: models and trends. IEEE Signal Process Mag 28(6):18–28

    Article  Google Scholar 

  7. ITU-T Recommendation P.563. (2004) Single-ended method for objective speech quality assessment in narrow-band telphony applications

  8. Falk TH et al (2015) Objective quality and intelligibility prediction for users of assistive listening devices: advantages and limitations of existing tools. IEEE Signal Process Mag 32(2):114–124

    Article  Google Scholar 

  9. Avila AR et al (2019) Non-intrusive speech quality assessment using neural networks. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 631–635. IEEE

  10. Gamper H et al (2019) Intrusive and non-intrusive perceptual speech quality assessment using a convolutional neural network. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp 85–89. IEEE

  11. Soni MH, Patil HA (2016) Novel subband autoencoder features for non-intrusive quality assessment of noise suppressed speech. In INTERSPEECH, pp 3708–3712

  12. Cauchi B et al (2019) Non-intrusive speech quality prediction using modulation energies and lstm-network. IEEE/ACM Trans Audio Speech Lang Process 27(7):1151–1163

    Article  Google Scholar 

  13. Avila AR et al (2019) Blind channel response estimation for replay attack detection. Proc. Interspeech 2893–2897

  14. Avila AR et al (2019) Intrusive quality measurement of noisy and enhanced speech based on i-vector similarity. In Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pp 1–5. IEEE

  15. Hermansky H, Morgan N (1994) Rasta processing of speech. IEEE Trans Speech Audio Process 2(4):578–589

    Article  Google Scholar 

  16. Falk TH, Chan W-Y (2006) Single-ended speech quality measurement using machine learning methods. IEEE Trans Audio Speech Lang Process 14(6):1935–1947

    Article  Google Scholar 

  17. Gaubitch ND, Brookes M, Naylor AA (2013) Blind channel magnitude response estimation in speech using spectrum classification. IEEE Trans Audio Speech Lang Process 21(10):2162–2171

    Article  Google Scholar 

  18. Kenny P et al (2007) Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans Audio Speech Lang Process 15(4):1435–1447

    Article  Google Scholar 

  19. Dehak N et al (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798

    Article  Google Scholar 

  20. Hansen JHL, Hasan T (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process Mag 32(6):74–99

    Article  Google Scholar 

  21. Kenny P, Boulianne G, Dumouchel P (2005) Eigenvoice modeling with sparse training data. IEEE Trans Speech Audio Process 13(3):345–354

    Article  Google Scholar 

  22. Garcia-Romero D, Espy-Wilson CY (2011) Analysis of i-vector length normalization in speaker recognition systems. In Twelfth Annual Conference of the International Speech Communication Association

  23. Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. Journal of Machine Learning Research 9(Nov)

  24. Series B (2014) Recommendation itu-r bs. 1534-3 method for the subjective assessment of intermediate quality level of audio systems. International Telecommunication Union Radio Communication Assembly

  25. Santos JF, Falk TH (2019) Towards the development of a non-intrusive objective quality measure for dnn-enhanced speech. In 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pp 1–6. IEEE

  26. Schoeffler M et al (2018) webmushra-a comprehensive framework for web-based listening tests. J Open Res Softw 6(1)

  27. Valentini-Botinhao C et al (2017) Noisy speech database for training speech enhancement algorithms and tts models. University of Edinburgh. School of Informatics, Centre for Speech Technology Research (CSTR)

  28. Pascual S, Bonafonte A, Serrà J (2017) Segan: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452

  29. Veaux C, Yamagishi J, King S (2013) The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp 1–4. IEEE

  30. Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251

    Article  Google Scholar 

  31. Santos JF, Falk TH (2018) Speech dereverberation with context-aware recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process 26(7):1236–1246

    Article  Google Scholar 

  32. Williamson DS, Wang D (2017) Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans Audio Speech Lang Processing 25(7):1492–1501

    Article  Google Scholar 

  33. Wu B et al (2016) A reverberation-time-aware approach to speech dereverberation based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process 25(1):102–111

    Article  Google Scholar 

  34. ITU-T Recommendation P.862. Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs, February 2001

  35. Rix A, Beerends J, Hollier M, Hekstra A (2001) Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs. pp 749–752

  36. ITU-T Recommendation P.863. (2008) Perceptual objective listening quality assessment: An advanced objective perceptual method for end-to-end listening speech quality evaluation of fixed, mobile, and IP-based networks and speech codecs covering narrowband, wideband, and super-wideband signals. Technical report

  37. Beerends J et al (2013) Perceptual objective listening quality assessment (polqa), the third generation itu-t standard for end-to-end speech quality measurement part ii: Perceptual model. Audio Eng Soc 61(6)

  38. Ma J, Hu Y, Loizou PC (2009) Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. J Acoust Soc Am 125(5):3387–3405

    Article  Google Scholar 

  39. Janssen JH (1957) A method for the calculation of the speech intelligibility under conditions of reverberation and noise. Acta Acustica united with Acustica 7(5):305–310

    Google Scholar 

  40. Taal CH et al (2010) A short-time objective intelligibility measure for time-frequency weighted noisy speech. pp 4214–4217

  41. Malfait L, Berger J, Kastner M (2006) P.563 - the ITU-T standard for single-ended speech quality assessment. IEEE Trans Audio Speech Lang Process 14(6):1924–1934

    Article  Google Scholar 

  42. ITU-T Recommendation P.830. (1996) Subjective performance assessment of telephone-band and wideband digital codecs

  43. Falk TH, Zheng C, Chan W-Y (2010) A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech. IEEE Trans Audio Speech Lang Process 18(7):1766–1774

    Article  Google Scholar 

  44. Santos JF, Senoussaoui M, Falk TH (2014) An improved non-intrusive intelligibility metric for noisy and reverberant speech. pp 55–59

  45. Fu SW et al (2018) Quality-net: An end-to-end non-intrusive speech quality assessment model based on blstm. arXiv preprint arXiv:1808.05344

  46. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436

    Article  Google Scholar 

  47. ITU-T Recommendation P.800. (1998) Recommendation P.800: Methods for subjectiuve determination of transmission quality

  48. Rix AW (2003) Comparison between subjective listening quality and p. 862 pesq score. Proc. Measurement of Speech and Audio Quality in Networks (MESAQIN’03), Prague, Czech Republic

  49. Shcherbakov MV et al (2013) A survey of forecast error measures. World Appl Sci J 24(24):171–176

    Google Scholar 

  50. ITU-T Recommendation P.862.1. (2003) Mapping function for transforming p.862 raw result scores to mos-lq

  51. Falk TH, Chan W-Y (2010) Temporal dynamics for blind measurement of room acoustical parameters. IEEE Trans Instrum Meas 59(4):978–989

    Article  Google Scholar 

  52. Kenny P et al (2008) A study of interspeaker variability in speaker verification. IEEE Trans Audio Speech Lang Process 16(5):980–988

    Article  Google Scholar 

  53. ITU-T Recommendation P.835. (2003) Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm. International Telecommunication Union, Geneva

  54. Cauchi B et al (2015) Combination of MVDR beamforming and single-channel spectral processing for enhancing noisy and reverberant speech. EURASIP J Adv Signal Process 2015:1–12

    Article  Google Scholar 

  55. Thiemann J et al (2016) Speech enhancement for multimicrophone binaural hearing aids aiming to preserve the spatial auditory scene. EURASIP J Adv Signal Process 2016(1):12

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), the Fonds de recherche du Québec - Nature et Technologies (FRQNT), and the Natural Sciences and Engineering Research Council of Canada (NSERC) for their financial support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anderson R. Avila.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Avila, A.R., O’Shaughnessy, D. & Falk, T.H. Non-intrusive speech quality prediction based on the blind estimation of clean speech and the i-vector framework. Qual User Exp 5, 11 (2020). https://doi.org/10.1007/s41233-020-00040-3

Download citation

  • Received:

  • Published:

  • DOI: https://doi.org/10.1007/s41233-020-00040-3

Keywords

Navigation