Abstract
The partially observable Markov decision process (POMDP) framework has been applied in dialogue systems as a formal framework to represent uncertainty explicitly while being robust to noise. In this context, estimating the dialogue POMDP model components (states, observations, and reward) is a significant challenge as they have a direct impact on the optimized dialogue POMDP policy. Learning states and observations sustaining a POMDP have been both covered in the first part (Part I), whereas this part (Part II) covers learning the reward function, that is required by the POMDP. To this end, we propose two specific algorithms based on inverse reinforcement learning (IRL). The first is called POMDP-IRL-BT (BT for belief transition) and it approximates a belief transition model, similar to the Markov decision process transition models. The second is a point-based POMDP-IRL algorithm, denoted by PB-POMDP-IRL (PB for point-based), that approximates the value of the new beliefs, which occurs in the computation of the policy values, using a linear approximation of expert beliefs. Ultimately, we apply the two algorithms on healthcare dialogue management in order to learn a dialogue POMDP from dialogues collected by SmartWheeler (an intelligent wheelchair).





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abbeel, P., Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine learning (ICML’04). Banff, AB, Canada.
Boularias, A., Chinaei, H. R., & Chaib-draa, B., (2010). Learning the reward model of dialogue POMDPs from data. In NIPS 2010 Workshop on Machine Learning for Assistive Technologies. Vancouver, BC, Canada.
Boularias, A., Kober, J., & Peters, J. (2011). Relative entropy inverse reinforcement learning. Journal of Machine Learning Research—Proceedings Track, 15, 182–189.
Chandramohan, S., Geist, M., Lefèvre, F., & Pietquin, O. (2012). Behavior specific user simulation in spoken dialogue systems. In Proceedings of the IEEE ITG Conference on Speech Communication. Braunschweig, Germany.
Chinaei, H. R., & Chaib-draa, B. (2011). Learning dialogue POMDP models from data. In Proceedings of the 24th Canadian Conference on Advances in Artificial Intelligence (Canadian AI’11). St. John’s, NL, Canada.
Chinaei, H. R., & Chaib-draa, B. (2014). Dialogue POMDP components (Part I): Learning states and observations. International Journal of Speech Technologyn (this issue).
Chinaei, H. R., Chaib-draa, B., & Lamontagne, L. (2012). Learning observation models for dialogue POMDPs. In Proceedings of the 24th Canadian conference on advances in Artificial Intelligence (Canadian AI’12). Toronto, ON, Canada.
Choi, J., & Kim, K.-E. (2011). Inverse reinforcement learning in partially observable environments. Journal of Machine Learning Research, 12, 691–730.
Gašić, M. (2011). Statistical Dialogue Modelling. PhD thesis, Department of Engineering, University of Cambridge.
Ji, S., Parr, R., Li, H., Liao, X., & Carin, L. (2007). Point-based policy iteration. In Proceedings of the 22nd National Conference on Artificial Intelligence (vol. 2) (AAAI’07). Vancouver, BC, Canada.
Kim, D., Kim, J., & Kim, K. (2011). Robust performance evaluation of POMDP-based dialogue systems. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 1029–1040.
Neu, G., Szepesvári, C. (2007). Apprenticeship learning using inverse reinforcement learning and gradient methods. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI’07). Vancouver, BC, Canada.
Ng, A. Y., Russell, S. J. (2000). Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). Stanford, CA, USA.
Paek, T., & Pieraccini, R. (2008). Automating spoken dialogue management design using machine learning: An industry perspective. Speech Communication, 50(8), 716–729.
Pinault, F. and Lefèvre, F. (2011). Semantic graph clustering for pomdp-based spoken dialog systems. In Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH’11). Florence, Italy.
Pineau, J., Gordon, G., & Thrun, S. (2003). Point-based value iteration: An anytime algorithm for POMDPs. In International Joint Conference on Artificial Intelligence (IJCAI’03). Acapulco, Mexico.
Pineau, J., West, R., Atrash, A., Villemure, J., & Routhier, F. (2011). On the feasibility of using a standardized test for evaluating a speech-controlled smart wheelchair. International Journal of Intelligent Control and Systems, 16(2), 124–131.
Ramachandran, D., & Amir, E. (2007). Bayesian inverse reinforcement learning. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07). Hyderabad, India.
Roy, N., Pineau, J., & Thrun, S. (2000). Spoken dialogue management using probabilistic reasoning. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (ACL’00). Hong Kong.
Spaan, M., & Vlassis, N. (2005). Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research, 24(1), 195–220.
Syed, U. and Schapire, R. (2008). A game-theoretic approach to apprenticeship learning. In Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems. Vancouver, BC, Canada.
Thomson, B. (2009). Statistical Methods for Spoken Dialogue Management. PhD thesis, Department of Engineering, University of Cambridge.
Williams, J. D. (2006). Partially Observable Markov Decision Processes for Spoken Dialogue Management. PhD thesis, Department of Engineering, University of Cambridge.
Williams, J. D., & Young, S. (2005). The SACTI-1 corpus: Guide for research users. Technical Report. Department of Engineering, University of Cambridge.
Williams, J. D., & Young, S. (2007). Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language, 21, 393–422.
Zhang, B., Cai, Q., Mao, J., Chang, E., & Guo, B. (2001a). Spoken dialogue management as planning and acting under uncertainty. In Proceedings of the 9th European Conference on Speech Communication and Technology (Eurospeech’01). Aalborg, Denmark.
Zhang, B., Cai, Q., Mao, J., & Guo, B. (2001b). Planning and acting under uncertainty: A new model for spoken dialogue system. In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence (UAI’01), Seattle, WA, USA.
Ziebart, B., Maas, A., Bagnell, J., & Dey, A. (2008). Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence (AAAI’08). Chicago, IL, USA.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chinaei, H., Chaib-draa, B. Dialogue POMDP components (Part II): learning the reward function. Int J Speech Technol 17, 325–340 (2014). https://doi.org/10.1007/s10772-014-9224-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-014-9224-x