{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,7,23]],"date-time":"2024-07-23T00:20:58Z","timestamp":1721694058599},"reference-count":38,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2024,6,1]],"date-time":"2024-06-01T00:00:00Z","timestamp":1717200000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,6,5]],"date-time":"2024-06-05T00:00:00Z","timestamp":1717545600000},"content-version":"vor","delay-in-days":4,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Speech Technol"],"published-print":{"date-parts":[[2024,6]]},"abstract":"Abstract<\/jats:title>On-device virtual assistants (VAs) powered by automatic speech recognition (ASR) require effective knowledge integration for the challenging entity-rich query recognition. In this paper, we conduct an empirical study of modeling strategies for server-side rescoring of spoken information domain queries using various categories of language models (LMs) (N<\/jats:italic><\/jats:bold>-gram word LMs, sub-word neural LMs). We investigate the combination of on-device and server-side signals, and demonstrate significant word error rate improvements of 23%-35% relative on various entity-centric query subpopulations by integrating various server-side LMs compared to performing ASR on-device only. We also perform a comparison between LMs trained on domain data and a generative pre-trained (GPT) (a variant GPT-3) offered by OpenAI as a baseline. Furthermore, we also show that model fusion of multiple server-side LMs trained from scratch most effectively combines complementary strengths of each model and integrates knowledge learned from domain-specific data to a VA ASR system.<\/jats:p>","DOI":"10.1007\/s10772-024-10102-y","type":"journal-article","created":{"date-parts":[[2024,6,5]],"date-time":"2024-06-05T18:04:33Z","timestamp":1717610673000},"page":"367-375","update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Server-side rescoring of spoken entity-centric knowledge queries for virtual assistants"],"prefix":"10.1007","volume":"27","author":[{"given":"Youyuan","family":"Zhang","sequence":"first","affiliation":[]},{"given":"Sashank","family":"Gondala","sequence":"additional","affiliation":[]},{"given":"Thiago","family":"Fraga-Silva","sequence":"additional","affiliation":[]},{"ORCID":"http:\/\/orcid.org\/0000-0003-3433-7317","authenticated-orcid":false,"given":"Christophe Van","family":"Gysel","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,6,5]]},"reference":[{"key":"10102_CR1","doi-asserted-by":"crossref","unstructured":"Achanta, S., Antony, A., Golipour, L., Li, J., Raitio, T., Rasipuram, R., Rossi, F., Shi, J., Upadhyay, J., Winarsky, D., & Zhang, H. (2021). On-device neural speech synthesis. In ASRU (pp 1155\u20131161). IEEE.","DOI":"10.1109\/ASRU51503.2021.9688154"},{"key":"10102_CR2","volume-title":"A neural probabilistic language model","author":"Y Bengio","year":"2000","unstructured":"Bengio, Y., Ducharme, R., & Vincent, P. (2000). A neural probabilistic language model. NeurIPS."},{"key":"10102_CR3","first-page":"1877","volume":"33","author":"T Brown","year":"2020","unstructured":"Brown, T., Mann, B., Ryder, N., et al., ... Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877\u20131901.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"10102_CR4","unstructured":"Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30"},{"key":"10102_CR5","first-page":"4171","volume-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","author":"J Devlin","year":"2019","unstructured":"Devlin, J., Chang, M. W., Lee, K., et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding (pp. 4171\u20134186). NAACL."},{"key":"10102_CR6","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735\u20131780.","journal-title":"Neural Computation"},{"key":"10102_CR7","unstructured":"Huang, H., & Peng, F. (2019). An empirical study of efficient ASR rescoring with transformers. arXiv preprintarXiv:1910.11450"},{"key":"10102_CR8","doi-asserted-by":"publisher","first-page":"689","DOI":"10.21437\/Interspeech.2022-10820","volume-title":"Sentence-select: Large-scale language model data selection for rare-word speech recognition","author":"WR Huang","year":"2022","unstructured":"Huang, W. R., Peyser, C., Sainath, T., Pang, R., Strohman, T. D., & Kumar, S. (2022). Sentence-select: Large-scale language model data selection for rare-word speech recognition (pp. 689\u2013693). Interspeech. https:\/\/doi.org\/10.21437\/Interspeech.2022-10820"},{"key":"10102_CR9","first-page":"6854","volume-title":"SNDCNN: Self-normalizing deep CNNs with scaled exponential linear units for speech recognition","author":"Z Huang","year":"2020","unstructured":"Huang, Z., Ng, T., Liu, L., Mason, H., Zhuang, X., & Liu, D. (2020). SNDCNN: Self-normalizing deep CNNs with scaled exponential linear units for speech recognition (pp. 6854\u20136858). ICASSP."},{"key":"10102_CR10","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-2225","volume-title":"Language modeling with deep transformers","author":"K Irie","year":"2019","unstructured":"Irie, K., Zeyer, A., Schl\u00fcter, R., & Ney, H. (2019). Language modeling with deep transformers. Interspeech."},{"key":"10102_CR11","unstructured":"Juniper Research (2019). Digital voice assistants in use to triple to 8 billion by 2023, driven by smart home devices. Retrieved from https:\/\/www.juniperresearch.com\/press\/digital-voice-assistants-in-use-to-8-million-2023."},{"key":"10102_CR12","volume-title":"Speech and language processing","author":"D Jurafsky","year":"2008","unstructured":"Jurafsky, D., & Martin, J. H. (2008). Speech and language processing (2nd ed.). Prentice Hall.","edition":"2"},{"key":"10102_CR13","volume-title":"Speech and language processing","author":"D Jurafsky","year":"2023","unstructured":"Jurafsky, D., & Martin, J. H. (2023). Speech and language processing (3rd ed.). Prentice Hall.","edition":"3"},{"key":"10102_CR14","doi-asserted-by":"publisher","first-page":"400","DOI":"10.1109\/TASSP.1987.1165125","volume":"35","author":"SM Katz","year":"1987","unstructured":"Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. ASSP, 35, 400\u2013401.","journal-title":"ASSP"},{"key":"10102_CR15","volume-title":"Adam: A method for stochastic optimization","author":"DP Kingma","year":"2015","unstructured":"Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR."},{"key":"10102_CR16","first-page":"66","volume-title":"SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing","author":"T Kudo","year":"2018","unstructured":"Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing (p. 66). EMNLP."},{"key":"10102_CR17","first-page":"1369","volume-title":"Alexa, can you help me shop?","author":"Y Maarek","year":"2019","unstructured":"Maarek, Y. (2019). Alexa, can you help me shop? (pp. 1369\u20131370). SIGIR."},{"key":"10102_CR18","unstructured":"OpenAI. (2023a). OpenAI \u2013 model index for researchers. Retrieved 13 Feb 2023, from https:\/\/platform.openai.com\/docs\/model-index-for-researchers."},{"key":"10102_CR19","unstructured":"OpenAI. (2023b). OpenAI \u2013 models. Retrieved 13 Feb 2023, from https:\/\/platform.openai.com\/docs\/models\/gpt-3."},{"key":"10102_CR20","unstructured":"Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J. Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems."},{"key":"10102_CR21","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2022-352","volume-title":"ASR-generated text for language model pre-training applied to speech tasks","author":"V Pelloin","year":"2022","unstructured":"Pelloin, V., Dary, F., Herv\u00e9, N., Favre, B., Camelin, N., Laurent, A., & Besacier, L. (2022). ASR-generated text for language model pre-training applied to speech tasks. Interspeech."},{"key":"10102_CR22","doi-asserted-by":"publisher","first-page":"4921","DOI":"10.21437\/Interspeech.2020-1465","volume-title":"Improving tail performance of a deliberation E2E ASR model using a large text corpus","author":"C Peyser","year":"2020","unstructured":"Peyser, C., Mavandadi, S., Sainath, T. N., Apfel, J., Pang, R., & Kumar, S. (2020). Improving tail performance of a deliberation E2E ASR model using a\nlarge text corpus (pp. 4921\u20134925). Interspeech. https:\/\/doi.org\/10.21437\/Interspeech.2020-1465"},{"issue":"2","key":"10102_CR23","doi-asserted-by":"publisher","first-page":"155","DOI":"10.1093\/comjnl\/7.2.155","volume":"7","author":"MJ Powell","year":"1964","unstructured":"Powell, M. J. (1964). An efficient method for finding the minimum of a function of several variables without calculating derivatives. The Computer Journal, 7(2), 155\u2013162.","journal-title":"The Computer Journal"},{"issue":"140","key":"10102_CR24","first-page":"1","volume":"21","author":"C Raffel","year":"2020","unstructured":"Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer.\nJournal of Machine Learning Research, 21(140), 1\u201367.","journal-title":"Journal of Machine Learning Research"},{"key":"10102_CR25","first-page":"2699","volume-title":"Masked language model scoring","author":"J Salazar","year":"2020","unstructured":"Salazar, J., Liang, D., Nguyen, T. Q., & Kirchhoff, K. (2020). Masked language model scoring (pp. 2699\u20132712). ACL."},{"key":"10102_CR26","first-page":"464","volume-title":"Self-attention with relative position representations","author":"P Shaw","year":"2018","unstructured":"Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations (pp. 464\u2013468). NAACL."},{"key":"10102_CR27","first-page":"1081","volume-title":"Asian conference on machine learning","author":"J Shin","year":"2019","unstructured":"Shin, J., Lee, Y., & Jung, K. (2019). Effective sentence scoring method using BERT for speech recognition. Asian conference on machine learning (pp. 1081\u20131093). PMLR."},{"key":"10102_CR28","unstructured":"Statista (2022). Smart home device household penetration in the United States in 2019 and 2021. Retrieved from https:\/\/www.statista.com\/statistics\/1247351\/smart-home-device-us-household-penetration"},{"key":"10102_CR29","doi-asserted-by":"publisher","DOI":"10.21437\/ICSLP.2002-303","volume-title":"SRILM-an extensible language modeling toolkit","author":"A Stolcke","year":"2002","unstructured":"Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. ICSLP."},{"key":"10102_CR30","first-page":"1613","volume-title":"Predicting entity popularity to improve spoken entity recognition by virtual assistants","author":"C Van Gysel","year":"2020","unstructured":"Van Gysel, C., Tsagkias, M., Pusateri, E., & Oparin, I. (2020). Predicting entity popularity to improve spoken entity recognition by virtual assistants (pp. 1613\u20131616). SIGIR."},{"key":"10102_CR31","volume-title":"Space-efficient representation of entity-centric query language models","author":"C Van Gysel","year":"2022","unstructured":"Van Gysel, C., Hannemann, M., Pusateri, E., Oualil, Y., & Oparin, I. (2022). Space-efficient representation of entity-centric query language models. Interspeech."},{"key":"10102_CR32","volume-title":"Attention is all you need","author":"A Vaswani","year":"2017","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. NeurIPS."},{"key":"10102_CR33","first-page":"261","volume-title":"Leveraging ASR n-best in deep entity retrieval","author":"H Wang","year":"2021","unstructured":"Wang, H., Chen, J., Laali, M., Durda, K., King, J., Campbell, W., & Liu, Y. (2021). Leveraging ASR n-best in deep entity retrieval (pp. 261\u2013265). Interspeech."},{"key":"10102_CR34","first-page":"4725","volume-title":"Dual fixed-size ordinally forgetting encoding (FOFE) for competitive neural language models","author":"S Watcharawittayakul","year":"2018","unstructured":"Watcharawittayakul, S., Xu, M., & Jiang, H. (2018). Dual fixed-size ordinally forgetting encoding (FOFE) for competitive neural language models (pp. 4725\u20134730). EMNLP."},{"key":"10102_CR35","doi-asserted-by":"publisher","first-page":"1031","DOI":"10.21437\/Interspeech.2022-10660","volume-title":"Improving Rare Word Recognition with LM-aware MWER Training","author":"W Weiran","year":"2022","unstructured":"Weiran, W., Chen, T., Sainath, T. N., Variani, E., Prabhavalkar, R., Huang, R., Ramabhadran, B., Gaur, N., Mavandadi, S., Peyser, C., Strohman, T., He, Y., & Rybach, D. (2022). Improving Rare Word Recognition with LM-aware MWER Training (pp. 1031\u2013 1035). Interspeech. https:\/\/doi.org\/10.21437\/Interspeech.2022-10660"},{"issue":"4","key":"10102_CR36","doi-asserted-by":"publisher","first-page":"1085","DOI":"10.1109\/18.87000","volume":"37","author":"IH Witten","year":"1991","unstructured":"Witten, I. H., & Bell, T. C. (1991). The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4), 1085\u20131094.","journal-title":"IEEE Transactions on Information Theory"},{"key":"10102_CR37","first-page":"6117","volume-title":"RescoreBERT: Discriminative speech recognition rescoring with Bert","author":"L Xu","year":"2022","unstructured":"Xu, L., Gu, Y., Kolehmainen, J., Khan, H., Gandhe, A., Rastrow, A., Stolcke, A., & Bulyko, I. (2022). RescoreBERT: Discriminative speech recognition rescoring with Bert (pp. 6117\u2013\n6121). ICASSP."},{"key":"10102_CR38","first-page":"495","volume-title":"The fixed-size ordinally-forgetting encoding method for neural network language models","author":"S Zhang","year":"2015","unstructured":"Zhang, S., Jiang, H., Xu, M., Hou, J., & Dai, L. (2015). The fixed-size ordinally-forgetting encoding method for neural network language models (pp. 495\u2013500). ACL-IJCNLP."}],"container-title":["International Journal of Speech Technology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10772-024-10102-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10772-024-10102-y\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10772-024-10102-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,22]],"date-time":"2024-07-22T16:07:33Z","timestamp":1721664453000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10772-024-10102-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,6]]},"references-count":38,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,6]]}},"alternative-id":["10102"],"URL":"https:\/\/doi.org\/10.1007\/s10772-024-10102-y","relation":{},"ISSN":["1381-2416","1572-8110"],"issn-type":[{"value":"1381-2416","type":"print"},{"value":"1572-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,6]]},"assertion":[{"value":"2 January 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 May 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 June 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors were employed by (a subsidiary of) Apple while performing the research and writing for this article.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}