HATS: An Open Data Set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics

Bañeras-Roux, Thibault; Wottawa, Jane; Rouvier, Mickael; Merlin, Teva; Dufour, Richard

doi:10.1007/978-3-031-40498-6_15

Thibault Bañeras-Roux¹⁰,
Jane Wottawa¹¹,
Mickael Rouvier¹²,
Teva Merlin¹² &
…
Richard Dufour¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14102))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

623 Accesses
1 Citations

Abstract

Conventionally, Automatic Speech Recognition (ASR) systems are evaluated on their ability to correctly recognize each word contained in a speech signal. In this context, the word error rate (WER) metric is the reference for evaluating speech transcripts. Several studies have shown that this measure is too limited to correctly evaluate an ASR system, which has led to the proposal of other variants of metrics (weighted WER, BERTscore, semantic distance, etc.). However, they remain system-oriented, even when transcripts are intended for humans. In this paper, we firstly present Human Assessed Transcription Side-by-side (HATS), an original French manually annotated data set in terms of human perception of transcription errors produced by various ASR systems. 143 humans were asked to choose the best automatic transcription out of two hypotheses. We investigated the relationship between human preferences and various ASR evaluation metrics, including lexical and embedding-based ones, the latter being those that correlate supposedly the most with human perception.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8579; Price includes VAT (Japan)

Softcover Book: JPY 10724; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Paradigm for Interpreting Metrics and Measuring Error Severity in Automatic Speech Recognition

Automatic quality estimation for speech translation using joint ASR and MT features

Article 01 June 2018

Everyday Conversations: A Comparative Study of Expert Transcriptions and ASR Outputs at a Lexical Level

Notes

References

Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020)
Google Scholar
Bañeras-Roux, T., Rouvier, M., Wottawa, J., Dufour, R.: Qualitative evaluation of language model rescoring in automatic speech recognition. In: Interspeech 2022 (2022)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Esteve, Y., Bazillon, T., Antoine, J.Y., Béchet, F., Farinas, J.: The EPAC corpus: manual and automatic annotations of conversational speech in French broadcast news. In: International Conference on Language Resources and Evaluation (LREC) (2010)
Google Scholar
Evain, S., et al.: Task agnostic and task specific self-supervised learning from speech with lebenchmark. In: Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021) (2021)
Google Scholar
Favre, B., et al.: Automatic human utility evaluation of ASR systems: does WER really predict performance? In: INTERSPEECH, pp. 3463–3467 (2013)
Google Scholar
Freitag, M., et al.: Results of WMT22 metrics shared task: stop using BLEU-neural metrics are better and more robust. In: Proceedings of the Seventh Conference on Machine Translation, Abu Dhabi. Association for Computational Linguistics (2022)
Google Scholar
Freitag, M., et al.: Results of the WMT21 metrics shared task: evaluating metrics with expert-based human evaluations on TED and news domain. In: Proceedings of the Sixth Conference on Machine Translation, pp. 733–774 (2021)
Google Scholar
Galliano, S., Geoffrois, E., Gravier, G., Bonastre, J.F., Mostefa, D., Choukri, K.: Corpus description of the ESTER evaluation campaign for the rich transcription of French broadcast news. In: International Conference on Language Resources and Evaluation (LREC), pp. 139–142 (2006)
Google Scholar
Galliano, S., Gravier, G., Chaubard, L.: The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts. In: Tenth Annual Conference of the International Speech Communication Association (2009)
Google Scholar
Giraudel, A., Carré, M., Mapelli, V., Kahn, J., Galibert, O., Quintard, L.: The REPERE corpus: a multimodal corpus for person recognition. In: International Conference on Language Resources and Evaluation (LREC), pp. 1102–1107 (2012)
Google Scholar
Gordeeva, L., Ershov, V., Gulyaev, O., Kuralenok, I.: Meaning error rate: ASR domain-specific metric framework. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 458–466 (2021)
Google Scholar
Grave, É., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Google Scholar
Gravier, G., Adda, G., Paulsson, N., Carré, M., Giraudel, A., Galibert, O.: The ETAPE corpus for the evaluation of speech-based TV content processing in the French language. In: International Conference on Language Resources and Evaluation (LREC), pp. 114–118 (2012)
Google Scholar
Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: Hubert: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3451–3460 (2021)
Article Google Scholar
Itoh, N., Kurata, G., Tachibana, R., Nishimura, M.: A metric for evaluating speech recognizer output based on human-perception model. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Juang, B.H., Rabiner, L.R.: Hidden Markov models for speech recognition. Technometrics 33(3), 251–272 (1991)
Article MathSciNet MATH Google Scholar
Kafle, S., Huenerfauth, M.: Evaluating the usability of automatically generated captions for people who are deaf or hard of hearing. In: Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility, pp. 165–174 (2017)
Google Scholar
Kim, S., et al.: Semantic distance: a new metric for ASR performance analysis towards spoken language understanding. In: Proceedings of the Interspeech 2021, pp. 1977–1981 (2021). https://doi.org/10.21437/Interspeech.2021-1929
Kim, S., et al.: Evaluating user perception of speech recognition system quality with semantic distance metric. In: Proceedings of the Interspeech 2022, pp. 3978–3982 (2022). https://doi.org/10.21437/Interspeech.2022-11144
Le, H., et al.: FlauBERT: unsupervised language model pre-training for French. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 2479–2490 (2020)
Google Scholar
Le, N.T., Servan, C., Lecouteux, B., Besacier, L.: Better evaluation of ASR in speech translation context using word embeddings. In: Proceedings of the Interspeech 2016, pp. 2538–2542 (2016). https://doi.org/10.21437/Interspeech.2016-464
Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219 (2020)
Google Scholar
Mathur, N., Wei, J., Freitag, M., Ma, Q., Bojar, O.: Results of the WMT20 metrics shared task. In: Proceedings of the Fifth Conference on Machine Translation, pp. 688–725 (2020)
Google Scholar
Mdhaffar, S., Estève, Y., Hernandez, N., Laurent, A., Dufour, R., Quiniou, S.: Qualitative evaluation of ASR adaptation in a lecture context: application to the pastel corpus. In: INTERSPEECH, pp. 569–573 (2019)
Google Scholar
Nam, S., Fels, D.: Simulation of subjective closed captioning quality assessment using prediction models. Int. J. Semant. Comput. 13(01), 45–65 (2019)
Article Google Scholar
Nowak, S., Rüger, S.: How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the International Conference on Multimedia Information Retrieval, pp. 557–566 (2010)
Google Scholar
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (No. CONF). IEEE Signal Processing Society (2011)
Google Scholar
Ravanelli, M., et al.: SpeechBrain: a general-purpose speech toolkit (2021). arXiv:2106.04624
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992 (2019)
Google Scholar
Vasilescu, I., Adda-Decker, M., Lamel, L.: Cross-lingual studies of ASR errors: paradigms for perceptual evaluations. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 3511–3518 (2012)
Google Scholar
Wang, Y.Y., Acero, A., Chelba, C.: Is word error rate a good indicator for spoken language understanding accuracy. In: 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721), pp. 577–582. IEEE (2003)
Google Scholar
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: evaluating text generation with BERT. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=SkeHuCVFDr

Download references

Author information

Authors and Affiliations

LS2N, Nantes University, Nantes, France
Thibault Bañeras-Roux & Richard Dufour
LIUM, Le Mans University, Le Mans, France
Jane Wottawa
LIA, Avignon University, Avignon, France
Mickael Rouvier & Teva Merlin

Authors

Thibault Bañeras-Roux
View author publications
You can also search for this author in PubMed Google Scholar
Jane Wottawa
View author publications
You can also search for this author in PubMed Google Scholar
Mickael Rouvier
View author publications
You can also search for this author in PubMed Google Scholar
Teva Merlin
View author publications
You can also search for this author in PubMed Google Scholar
Richard Dufour
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thibault Bañeras-Roux .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
František Pártl
University of West Bohemia, Pilsen, Czech Republic
Miloslav Konopík

Ethics declarations

Ethics Statement

The aim of this paper is to propose a new method for evaluating speech-to-text systems that better aligns with human perception. However, the inherent subjectivity of transcription quality means that if we optimize systems to correlate only with the perception of the studied population, it could be inequitable if this perception does not generalize to the rest of the population.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bañeras-Roux, T., Wottawa, J., Rouvier, M., Merlin, T., Dufour, R. (2023). HATS: An Open Data Set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-40498-6_15
Published: 23 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40497-9
Online ISBN: 978-3-031-40498-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

HATS: An Open Data Set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics