Abstract
This paper addresses the problem of batch Reinforcement Learning with Expert Demonstrations (RLED). In RLED, the goal is to find an optimal policy of a Markov Decision Process (MDP), using a data set of fixed sampled transitions of the MDP as well as a data set of fixed expert demonstrations. This is slightly different from the batch Reinforcement Learning (RL) framework where only fixed sampled transitions of the MDP are available. Thus, the aim of this article is to propose algorithms that leverage those expert data. The idea proposed here differs from the Approximate Dynamic Programming methods in the sense that we minimize the Optimal Bellman Residual (OBR), where the minimization is guided by constraints defined by the expert demonstrations. This choice is motivated by the the fact that controlling the OBR implies controlling the distance between the estimated and optimal quality functions. However, this method presents some difficulties as the criterion to minimize is non-convex, non-differentiable and biased. Those difficulties are overcome via the embedding of distributions in a Reproducing Kernel Hilbert Space (RKHS) and a boosting technique which allows obtaining non-parametric algorithms. Finally, our algorithms are compared to the only state of the art algorithm, Approximate Policy Iteration with Demonstrations (APID) algorithm, in different experimental settings.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Abbeel, P., Ng, A.: Apprenticeship learning via inverse reinforcement learning. In: Proc. of ICML (2004)
Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning (2008)
Archibald, T., McKinnon, K., Thomas, L.: On the generation of markov decision processes. Journal of the Operational Research Society (1995)
Aronszajn, N.: Theory of reproducing kernels. Transactions of the American Mathematical Society (1950)
Bertsekas, D.: Dynamic programming and optimal control, vol. 1. Athena Scientific, Belmont (1995)
Bradtke, S., Barto, A.: Linear least-squares algorithms for temporal difference learning. Machine Learning (1996)
Breiman, L.: Classification and regression trees. CRC Press (1993)
Clarke, F.: Generalized gradients and applications. Transactions of the American Mathematical Society (1975)
Farahmand, A., Munos, R., Szepesvári, C.: Error propagation for approximate policy and value iteration. In: Proc. of NIPS (2010)
Grubb, A., Bagnell, J.: Generalized boosting algorithms for convex optimization. In: Proc. of ICML (2011)
Judah, K., Fern, A., Dietterich, T.: Active imitation learning via reduction to iid active learning. In: Proc. of UAI (2012)
Kim, B., Farahmand, A., Pineau, J., Precup, D.: Learning from limited demonstrations. In: Proc. of NIPS (2013)
Klein, E., Geist, M., Piot, B., Pietquin, O.: Inverse reinforcement learning through structured classification. In: Proc. of NIPS (2012)
Lagoudakis, M., Parr, R.: Least-squares policy iteration. Journal of Machine Learning Research (2003)
Lever, G., Baldassarre, L., Gretton, A., Pontil, M., Grünewälder, S.: Modelling transition dynamics in mdps with rkhs embeddings. In: Proc. of ICML (2012)
Munos, R.: Performance bounds in l_p-norm for approximate value iteration. SIAM Journal on Control and Optimization (2007)
Piot, B., Geist, M., Pietquin, O.: Learning from demonstrations: Is it worth estimating a reward function? In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013, Part I. LNCS, vol. 8188, pp. 17–32. Springer, Heidelberg (2013)
Puterman, M.: Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons (1994)
Ratliff, N., Bagnell, J., Srinivasa, S.: Imitation learning for locomotion and manipulation. In: Proc. of IEEE-RAS International Conference on Humanoid Robots (2007)
Ratliff, N., Bagnell, J., Zinkevich, M.: Maximum margin planning. In: Proc. of ICML (2006)
Ross, S., Gordon, G., Bagnell, J.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proc. of AISTATS (2011)
Shor, N., Kiwiel, K., Ruszcaynski, A.: Minimization methods for non-differentiable functions. Springer (1985)
Sriperumbudur, B., Gretton, A., Fukumizu, K., Schölkopf, B., Lanckriet, G.: Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research (2010)
Syed, U., Bowling, M., Schapire, R.: Apprenticeship learning using linear programming. In: Proc. of ICML (2008)
Yu, B.: Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability (1994)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Piot, B., Geist, M., Pietquin, O. (2014). Boosted Bellman Residual Minimization Handling Expert Demonstrations. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2014. Lecture Notes in Computer Science(), vol 8725. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44851-9_35
Download citation
DOI: https://doi.org/10.1007/978-3-662-44851-9_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44850-2
Online ISBN: 978-3-662-44851-9
eBook Packages: Computer ScienceComputer Science (R0)