Abstract
In the Bayesian approach to sequential decision making, exact calculation of the (subjective) utility is intractable. This extends to most special cases of interest, such as reinforcement learning problems. While utility bounds are known to exist for this problem, so far none of them were particularly tight. In this paper, we show how to efficiently calculate a lower bound, which corresponds to the utility of a near-optimal memoryless policy for the decision problem, which is generally different from both the Bayes-optimal policy and the policy which is optimal for the expected MDP under the current belief. We then show how these can be applied to obtain robust exploration policies in a Bayesian reinforcement learning setting.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the 21st International Conference on Machine Learning (ICML 2004) (2004)
Asmuth, J., Li, L., Littman, M.L., Nouri, A., Wingate, D.: A Bayesian sampling approach to exploration in reinforcement learning. In: UAI 2009 (2009)
Auer, P., Jaksch, T., Ortner, R.: Near-optimal regret bounds for reinforcement learning. In: Proceedings of NIPS 2008 (2008)
Brafman, R.I., Tennenholtz, M.: R-max-a general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research 3, 213–231 (2003)
Brown, D.B., Smith, J.E., Sun, P.: Information relaxations and duality in stochastic dynamic programs. Operations Research 58(4), 785–801 (2010)
Castro, P.S., Precup, D.: Smarter Sampling in Model-Based Bayesian Reinforcement Learning. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS, vol. 6321, pp. 200–214. Springer, Heidelberg (2010)
de Farias, D.P., Van Roy, B.: The linear programming approach to approximate dynamic programming. Operations Research 51(6), 850–865 (2003)
de Farias, D.P., Van Roy, B.: On constraint sampling in the linear programming approach to approximate dynamic programming. Mathematics of Operations Research 293(3), 462–478 (2004)
Dearden, R., Friedman, N., Russell, S.J.: Bayesian Q-learning. In: AAAI/IAAI, pp. 761–768 (1998)
Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: Laskey, K.B., Prade, H. (eds.) Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI 1999), July 30-August 1, pp. 150–159. Morgan Kaufmann, San Francisco (1999)
DeGroot, M.H.: Optimal Statistical Decisions. John Wiley & Sons (1970)
Dimitrakakis, C.: Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning. In: 2nd International Conference on Agents and Artificial Intelligence (ICAART 2010), Valencia, Spain, pp. 259–264. ISNTICC, Springer (2009)
Dimitrakakis, C., Rothkopf, C.A.: Bayesian multitask inverse reinforcement learning. In: European Workshop on Reinforcement Learning, EWRL 2011 (2011)
Duff, M.O.: Optimal Learning Computational Procedures for Bayes-adaptive Markov Decision Processes. PhD thesis, University of Massachusetts at Amherst (2002)
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Monographs on Statistics & Applied Probability, vol. 57. Chapmann & Hall, ISBN 0412042312 (November 1993)
Fard, M.M., Pineau, J.: PAC-Bayesian model selection for reinforcement learning. In: NIPS 2010 (2010)
Furmston, T., Barber, D.: Variational methods for reinforcement learning. In: Teh, Y.W., Titterington, M. (eds.) Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS). JMLR: W&CP, vol. 9, pp. 241–248
Gittins, C.J.: Multi-armed Bandit Allocation Indices. John Wiley & Sons, New Jersey (1989)
Jacksh, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11, 1563–1600 (2010)
Kaelbling, L.P.: Learning in Embedded Systems. PhD thesis, ept of Computer Science, Stanford (1990)
Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. In: Proc. 15th International Conf. on Machine Learning, pp. 260–268. Morgan Kaufmann, San Francisco (1998)
Ng, A.Y., Russell, S.: Algorithms for inverse reinforcement learning. In: Proc. 17th International Conf. on Machine Learning, pp. 663–670. Morgan Kaufmann (2000)
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: ICML 2006, pp. 697–704. ACM Press, New York (2006)
Rogers, L.C.G.: Pathwise stochastic optimal control. SIAM Journal on Control and Optimization 46(3), 1116–1132 (2008)
Rothkopf, C.A., Dimitrakakis, C.: Preference Elicitation and Inverse Reinforcement Learning. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS, vol. 6913, pp. 34–48. Springer, Heidelberg (2011)
Snel, M., Whiteson, S.: Multi-Task Reinforcement Learning: Shaping and Feature Selection. In: EWRL 2011 (2011)
Strehl, A.L., Littman, M.L.: An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences 74(8), 1309–1331 (2008)
Strehl, A.L., Li, L., Littman, M.L.: Reinforcement learning in finite MDPs: PAC analysis. The Journal of Machine Learning Research 10, 2413–2444 (2009)
Strens, M.: A bayesian framework for reinforcement learning. In: ICML 2000, pp. 943–950. Citeseer (2000)
Wang, T., Lizotte, D., Bowling, M., Schuurmans, D.: Bayesian sparse sampling for on-line reward optimization. In: ICML 2005, pp. 956–963. ACM, New York (2005)
Wyatt, J.: Exploration control in reinforcement learning using optimistic model selection. In: Danyluk, A., Brodley, C. (eds.) Proceedings of the Eighteenth International Conference on Machine Learning (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dimitrakakis, C. (2012). Robust Bayesian Reinforcement Learning through Tight Lower Bounds. In: Sanner, S., Hutter, M. (eds) Recent Advances in Reinforcement Learning. EWRL 2011. Lecture Notes in Computer Science(), vol 7188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29946-9_19
Download citation
DOI: https://doi.org/10.1007/978-3-642-29946-9_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29945-2
Online ISBN: 978-3-642-29946-9
eBook Packages: Computer ScienceComputer Science (R0)