Boosted Bellman Residual Minimization Handling Expert Demonstrations

Piot, Bilal; Geist, Matthieu; Pietquin, Olivier

doi:10.1007/978-3-662-44851-9_35

Bilal Piot^23,24,
Matthieu Geist^23,24 &
Olivier Pietquin²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8725))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

4753 Accesses
15 Citations

Abstract

This paper addresses the problem of batch Reinforcement Learning with Expert Demonstrations (RLED). In RLED, the goal is to find an optimal policy of a Markov Decision Process (MDP), using a data set of fixed sampled transitions of the MDP as well as a data set of fixed expert demonstrations. This is slightly different from the batch Reinforcement Learning (RL) framework where only fixed sampled transitions of the MDP are available. Thus, the aim of this article is to propose algorithms that leverage those expert data. The idea proposed here differs from the Approximate Dynamic Programming methods in the sense that we minimize the Optimal Bellman Residual (OBR), where the minimization is guided by constraints defined by the expert demonstrations. This choice is motivated by the the fact that controlling the OBR implies controlling the distance between the estimated and optimal quality functions. However, this method presents some difficulties as the criterion to minimize is non-convex, non-differentiable and biased. Those difficulties are overcome via the embedding of distributions in a Reproducing Kernel Hilbert Space (RKHS) and a boosting technique which allows obtaining non-parametric algorithms. Finally, our algorithms are compared to the only state of the art algorithm, Approximate Policy Iteration with Demonstrations (APID) algorithm, in different experimental settings.

Download to read the full chapter text

Chapter PDF

Model-free reinforcement learning from expert demonstrations: a survey

Article 18 October 2021

Learning Constraints from Demonstrations

Batch-Constraint Inverse Reinforcement Learning

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Abbeel, P., Ng, A.: Apprenticeship learning via inverse reinforcement learning. In: Proc. of ICML (2004)
Google Scholar
Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning (2008)
Google Scholar
Archibald, T., McKinnon, K., Thomas, L.: On the generation of markov decision processes. Journal of the Operational Research Society (1995)
Google Scholar
Aronszajn, N.: Theory of reproducing kernels. Transactions of the American Mathematical Society (1950)
Google Scholar
Bertsekas, D.: Dynamic programming and optimal control, vol. 1. Athena Scientific, Belmont (1995)
MATH Google Scholar
Bradtke, S., Barto, A.: Linear least-squares algorithms for temporal difference learning. Machine Learning (1996)
Google Scholar
Breiman, L.: Classification and regression trees. CRC Press (1993)
Google Scholar
Clarke, F.: Generalized gradients and applications. Transactions of the American Mathematical Society (1975)
Google Scholar
Farahmand, A., Munos, R., Szepesvári, C.: Error propagation for approximate policy and value iteration. In: Proc. of NIPS (2010)
Google Scholar
Grubb, A., Bagnell, J.: Generalized boosting algorithms for convex optimization. In: Proc. of ICML (2011)
Google Scholar
Judah, K., Fern, A., Dietterich, T.: Active imitation learning via reduction to iid active learning. In: Proc. of UAI (2012)
Google Scholar
Kim, B., Farahmand, A., Pineau, J., Precup, D.: Learning from limited demonstrations. In: Proc. of NIPS (2013)
Google Scholar
Klein, E., Geist, M., Piot, B., Pietquin, O.: Inverse reinforcement learning through structured classification. In: Proc. of NIPS (2012)
Google Scholar
Lagoudakis, M., Parr, R.: Least-squares policy iteration. Journal of Machine Learning Research (2003)
Google Scholar
Lever, G., Baldassarre, L., Gretton, A., Pontil, M., Grünewälder, S.: Modelling transition dynamics in mdps with rkhs embeddings. In: Proc. of ICML (2012)
Google Scholar
Munos, R.: Performance bounds in l_p-norm for approximate value iteration. SIAM Journal on Control and Optimization (2007)
Google Scholar
Piot, B., Geist, M., Pietquin, O.: Learning from demonstrations: Is it worth estimating a reward function? In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013, Part I. LNCS, vol. 8188, pp. 17–32. Springer, Heidelberg (2013)
Chapter Google Scholar
Puterman, M.: Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons (1994)
Google Scholar
Ratliff, N., Bagnell, J., Srinivasa, S.: Imitation learning for locomotion and manipulation. In: Proc. of IEEE-RAS International Conference on Humanoid Robots (2007)
Google Scholar
Ratliff, N., Bagnell, J., Zinkevich, M.: Maximum margin planning. In: Proc. of ICML (2006)
Google Scholar
Ross, S., Gordon, G., Bagnell, J.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proc. of AISTATS (2011)
Google Scholar
Shor, N., Kiwiel, K., Ruszcaynski, A.: Minimization methods for non-differentiable functions. Springer (1985)
Google Scholar
Sriperumbudur, B., Gretton, A., Fukumizu, K., Schölkopf, B., Lanckriet, G.: Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research (2010)
Google Scholar
Syed, U., Bowling, M., Schapire, R.: Apprenticeship learning using linear programming. In: Proc. of ICML (2008)
Google Scholar
Yu, B.: Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

Supélec, IMS-MaLIS Research group, France
Bilal Piot & Matthieu Geist
UMI 2958 (GeorgiaTech-CNRS), France
Bilal Piot & Matthieu Geist
University Lille 1, LIFL (UMR 8022 CNRS/Lille 1) - SequeL team, Lille, France
Olivier Pietquin

Authors

Bilal Piot
View author publications
You can also search for this author in PubMed Google Scholar
Matthieu Geist
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Pietquin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Applied Sciences,Department of Computer and Decision Engineering, Université Libre de Bruxelles, Av. F. Roosevelt, CP 165/15, 1050, Brussels, Belgium
Toon Calders
Dipartimento di Informatica, Università degli Studi “Aldo Moro”, via Orabona 4, 70125, Bari, Italy
Floriana Esposito
Department of Computer Science, Universität Paderborn, Warburger Str. 100, 33098, Paderborn, Germany
Eyke Hüllermeier
Dipartimento di Informatica, Università degli Studi di Torino, Corso Svizzera 185, 10149, Torino, Italy
Rosa Meo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Piot, B., Geist, M., Pietquin, O. (2014). Boosted Bellman Residual Minimization Handling Expert Demonstrations. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2014. Lecture Notes in Computer Science(), vol 8725. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44851-9_35

Download citation

DOI: https://doi.org/10.1007/978-3-662-44851-9_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44850-2
Online ISBN: 978-3-662-44851-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics