Abstract
Nowadays, public institutions usually provide videos that contain important information in their webpages. However, people suffering from hearing impairment have difficulties accessing content provided by that mean, and the manual transcription of those videos is a time-consuming task. This problem can be faced by means of Automatic Speech Recognition (ASR) systems. In this work, we have evaluated the performance of several ASR systems when applied to videos from the Government of La Rioja, Spain. Our study shows that the Whisper medium model provides the best trade-off between accuracy and speed. Using this model, we have generated the transcription of all the videos from the YouTube channel of the Government of La Rioja. In addition, we have created a tool to facilitate this task for other YouTube Spanish channels. Hence, this can be seen as a step towards improving the accessibility of the information and contents produced by Spanish public administrations.
This work was partially supported by Ministerio de Ciencia e Innovación [PID2020-115225RB-I00 / AEI / 10.13039/501100011033], OTRI OTCA221110 and, by a regional project of the Government of La Rioja.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
On July 28th, 2022.
- 3.
The transcriptions of the videos are available at https://github.com/mirenmirari/subtitulos_canalgobierno.
- 4.
Available at https://huggingface.co/spaces/mirari/Whisper-Youtube.
References
Ardila, R., et al.: Common voice: a massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 (2019)
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)
Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell. arXiv preprint arXiv:1508.01211 (2015)
de España, C.G.: Ley 34/2002, de 11 de julio, de servicios de la sociedad de la información y de comercio electrónico. No 166 12 (2002)
Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020)
Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014)
Hernandez Mena, C.D.: Acoustic model in spanish: stt_es_quartznet15x5_ft_ep53_944h. (2022). https://huggingface.co/carlosdanielhernandezmena/stt_es_quartznet15x5_ft_ep53_944h
Hong, R., et al.: Video accessibility enhancement for hearing-impaired users. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 7(1), 1–19 (2011)
Hrinchuk, O., et al.: Nvidia nemo offline speech translation systems for IWSLT 2022. In: Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), pp. 225–231 (2022)
Hugging Face: Hugging Face Hub (2022). https://huggingface.co/docs/hub/index
Huggins-Daines, D., Kumar, M., Chan, A., Black, A.W., Ravishankar, M., Rudnicky, A.I.: Pocketsphinx: a free, real-time continuous speech recognition system for hand-held devices. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 1, p. I. IEEE (2006)
Jurafsky, D., Martin, J.H.: Speech and language processing (3rd draft ed.), 2019 (2022)
Kriman, S., et al.: QuartzNet: deep automatic speech recognition with 1d time-channel separable convolutions. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6124–6128. IEEE (2020)
Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, vol. 10, pp. 707–710. Soviet Union (1966)
Li, J., et al.: Jasper: an end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
Malik, M., Malik, M.K., Mehmood, K., Makhdoom, I.: Automatic speech recognition: a survey. Multimed. Tools Appl. 80, 9411–9457 (2021)
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411 (2020)
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022)
Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813 (2014)
Woodard, J., Nelson, J.: An information theoretic measure of speech recognition performance. In: Workshop on Standardisation for Speech I/O Technology, Naval Air Development Center, Warminster, PA (1982)
Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Martín, M.S., Heras, J., Mata, G. (2023). Automatic Generation of Subtitles for Videos of the Government of La Rioja. In: Dorronsoro, B., Chicano, F., Danoy, G., Talbi, EG. (eds) Optimization and Learning. OLA 2023. Communications in Computer and Information Science, vol 1824. Springer, Cham. https://doi.org/10.1007/978-3-031-34020-8_30
Download citation
DOI: https://doi.org/10.1007/978-3-031-34020-8_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34019-2
Online ISBN: 978-3-031-34020-8
eBook Packages: Computer ScienceComputer Science (R0)