Abstract
Today there are many services which provide information over the phone using a prerecorded or synthesized voice. These voices are invariant in speed. Humans giving information over the telephone, however, tend to adapt the speed of their presentation to suit the needs of the listener. This paper presents a preliminary model of this adaptation. In a corpus of simulated directory assistance dialogs the operator's speed in number-giving correlates with the speed of the user's initial response and with the user's speaking rate. Multiple regression gives a formula which predicts appropriate speaking rates, and these predictions correlate (.46) with the speeds observed in good dialogs in the corpus. It is therefore easy, at least in principle, to make systems which adapt their speed to users' needs.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Create System Development Company. (2001). Linux Library for Japanese Voice Synthesis. http://www.createsystem.co.jp/linux.html.
Giles, H., Mulac, A., Bradac, J.J., and Johnson, P. (1987). Speech accommodation theory: The first decade and beyond. In M.L. McLaughlin (Ed.), Communication Yearbook 10. Sage, pp. 13–48.
Goto, M., Itou, K., and Hayamizu, S. (1999). A real-time filled pause detection system for spontaneous speech recognition. Eurospeech'99, pp.227–230.
Ishizaki, M. and Den, Y. (2001).Danwa to Taiwa (Conversation and Dialog). University of Tokyo Press.
Iwase, T. and Ward, N. (1998). Pacing spoken directions to suit the listener. International Conference on Spoken Language Processing. pp. 1203–1206.
Juuten (1995). Monbusho (Japanese Ministry of Education) Juuten (Intensive) Research Project on Speech, Language and Concepts, Dialog Corpus volume 4. CD-ROM.
Komatani, K., Ueno, S., Kawahara, T., and Okun, H.G. (2003). User modeling in spoken dialogue systems for flexible guidance generation. Eurospeech, pp. 745–748.
Koreman, J. (2003). The perception of articulation rate. International Congress of the Phonetic Sciences, pp. 1711–1714.
Langley, P. (1999). User modeling in adaptive interfaces. Proceedings of the Seventh International Conferenc on User Modeling. Springer, pp. 198–205.
McInnes, F. and Attwater, D. (2004). Turn taking and grounding in spoken telephone number transfers. Speech Communication. (in press).
Morgan, N. and Fosler-Lussier, E. (1998). Combining multiple estimators of speaking rate. ICASSP. IEEE, pp. 721–724.
Nakagawa, S. and Ward, N. (2003). Adaptive number-giving for directory assistance (in Japanese). Human Interface, 5:391–396.
Olaszy, G. and Nemeth, G. (1999). IVR for banking and residential telepone subscribers using stored messages combined with a new number-to-speech synthesis method. D. Gardner-Bonneau (Ed.), Human Factors and Voice Interactive Systems. Kluwer, pp. 237–256.
Oviatt, S., MacEachern, M., and Levow, G.-A. (1998). Predicting hyperarticulate speech during human-computer error resolution. Speech Communication, 24:87–110.
Reiter, E. and Dale, R. (2000). Building Natural Language Generation System. Cambridge University Press.
Resnick, P. and Virzi, R.A. (1992). Skip and scan: Cleaning up telephone interfaces. CHI '92. ACM, pp. 419–426.
Schmandt, C. (1994). Computers and Communication. Van Nostrand Reinhold.
Singh, S., Litman, D., Kearns, M., and Walker, M. (2002). Optimizing dialog management with reinforcement learning: Experiments with the NJFun system. Journal of Artificial Intelligence Research, 16:105–133.
Summers, W.V., Pisoni, D.B., Bernacki, R.H., Pedlow, R.I., and Stokes, M.A. (1988). Effects of noise on speech production: Acoustic and perceptual analyses. Journal of the Acoustical Society of America, 84:917–928.
Suzuki, N. (2001). Social effects on vocal rate with echoic mimicry using prosdy-only voice. Eurospeech. pp. 2431–2435.
Takamaru, K., Hiroshige, M., Araki, K., and Tochinai, K. (2000). A proposal of a model to extract Japanese voluntary speech rate control. International Conference on Spoken Language Processing. pp.255–258.
Walker, M.A. and Rambow, O.C. (2002). Spoken language generation. Computer Speech and Language, 16:273–281.
Ward, N. (1998). The relationship between sound and meaning in Japanese back-channel grunts. Proceedings of the 4th Annual Meeting of the (Japanese) Association for Natural Language Processing. pp. 464–467.
Ward, N. (2004). Pragmatic functions of prosodic features in non-lexical utterances. Speech Prosody 04, pp. 325–329.
Ward, N. and Nakagawa, S. (2002). Automatic user-adaptive speaking rate selection for information delivery. International Conference on Spoken Language Processing.
Ward, N. and Tsukahara, W. (2000). Prosodic features which cue back-channel feedback in English and Japanese. Journal of Pragmatics, 32:1177–1207.
Ward, N. and Tsukahara, W. (2003). A study in responsiveness in spoken dialog. International Journal of Human-Computer Studies, 59:603–630.
Yamashita, Y. and Matsumoto, H. (2002). Acoustical correlates to SD ratings of speaker characteristics in two speaking styles. International Conference on Spoken Language Processing, pp. 2577–2580.
Rights and permissions
About this article
Cite this article
Ward, N., Nakagawa, S. Automatic User-Adaptive Speaking Rate Selection. International Journal of Speech Technology 7, 259–268 (2004). https://doi.org/10.1023/B:IJST.0000037070.31146.f9
Issue Date:
DOI: https://doi.org/10.1023/B:IJST.0000037070.31146.f9