Abstract
Many motor skills in humanoid robotics can be learned using parametrized motor primitives. While successful applications to date have been achieved with imitation learning, most of the interesting motor learning problems are high-dimensional reinforcement learning problems. These problems are often beyond the reach of current reinforcement learning methods. In this paper, we study parametrized policy search methods and apply these to benchmark problems of motor primitive learning in robotics. We show that many well-known parametrized policy search methods can be derived from a general, common framework. This framework yields both policy gradient methods and expectation-maximization (EM) inspired algorithms. We introduce a novel EM-inspired algorithm for policy learning that is particularly well-suited for dynamical system motor primitives. We compare this algorithm, both in simulation and on a real robot, to several well-known parametrized policy search methods such as episodic REINFORCE, ‘Vanilla’ Policy Gradients with optimal baselines, episodic Natural Actor Critic, and episodic Reward-Weighted Regression. We show that the proposed method out-performs them on an empirical benchmark of learning dynamical system motor primitives both in simulation and on a real robot. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task on a real Barrett WAM™ robot arm.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine Learning, 50(1), 5–43.
Atkeson, C. G. (1994). Using local trajectory optimizers to speed up global optimization in dynamic programming. In Advances in neural information processing systems (Vol. 6, pp. 503–521), Denver, CO, USA.
Attias, H. (2003). Planning by probabilistic inference. In Proceedings of the ninth international workshop on artificial intelligence and statistics (AISTATS), Key West, FL, USA.
Bagnell, J., & Schneider, J. (2003). Covariant policy search. In Proceedings of the international joint conference on artificial intelligence (IJCAI) (pp. 1019–1024), Acapulco, Mexico.
Bagnell, J., Kadade, S., Ng, A., & Schneider, J. (2004). Policy search by dynamic programming. In Advances in neural information processing systems (Vol. 16), Vancouver, BC, CA.
Binder, J., Koller, D., Russell, S., & Kanazawa, K. (1997). Adaptive probabilistic networks with hidden variables. Machine Learning, 29(2–3), 213–244.
Chiappa, S., Kober, J., & Peters, J. (2009). Using Bayesian dynamical systems for motion template libraries. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural information processing systems (Vol. 21, pp. 297–304).
DARPA (2010a). Learning locomotion (L2). http://www.darpa.mil/ipto/programs/ll/ll.asp.
DARPA (2010b). Learning applied to ground robotics (LAGR). http://www.darpa.mil/ipto/programs/lagr/lagr.asp.
DARPA (2010c). Autonomous robot manipulation (ARM). http://www.darpa.mil/ipto/programs/arm/arm.asp.
Dayan, P., & Hinton, G. E. (1997). Using expectation-maximization for reinforcement learning. Neural Computation, 9(2), 271–278.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological), 39, 1–38.
El-Fakdi, A., Carreras, M., & Ridao, P. (2006). Towards direct policy search reinforcement learning for robot control. In Proceedings of the IEEE/RSJ 2006 international conference on intelligent robots and systems (IROS), Beijing, China.
Fantoni, I., & Lozano, R. (2001). Non-linear control for underactuated mechanical systems. New York: Springer.
Guenter, F., Hersch, M., Calinon, S., & Billard, A. (2007). Reinforcement learning for imitating constrained reaching movements. Advanced Robotics, Special Issue on Imitative Robots, 21(13), 1521–1544.
Gullapalli, V., Franklin, J., & Benbrahim, H. (1994). Acquiring robot skills via reinforcement learning. IEEE Control Systems Journal, Special Issue on Robotics: Capturing Natural Motion, 4(1), 13–24.
Hoffman, M., Doucet, A., de Freitas, N., & Jasra, A. (2007). Bayesian policy learning with trans-dimensional MCMC. In Advances in neural information processing systems (Vol. 20), Vancouver, BC, CA.
Ijspeert, A. J., Nakanishi, J., & Schaal, S. (2002). Movement imitation with nonlinear dynamical systems in humanoid robots. In Proceedings of IEEE international conference on robotics and automation (ICRA) (pp. 1398–1403), Washington, DC.
Ijspeert, A. J., Nakanishi, J., & Schaal, S. (2003). Learning attractor landscapes for learning motor primitives. In Advances in neural information processing systems (Vol. 15, pp. 1547–1554), Vancouver, BC, CA.
Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). Convergence of stochastic iterative dynamic programming algorithms. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems (Vol. 6, pp. 703–710). San Mateo: Morgan Kaufmann.
Kirk, D. E. (1970). Optimal control theory. Englewood Cliffs: Prentice-Hall.
Kober, J., & Peters, J. (2009a). Learning motor primitives for robotics. In Proceedings of IEEE international conference on robotics and automation (ICRA) (pp. 2112–2118).
Kober, J., & Peters, J. (2009b). Policy search for motor primitives in robotics. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural information processing systems (Vol. 21, pp. 849–856).
Kober, J., Mohler, B., & Peters, J. (2008). Learning perceptual coupling for motor primitives. In Proceedings of the IEEE/RSJ 2008 international conference on intelligent robots and systems (IROS) (pp. 834–839), Nice, France.
Kormushev, P., Calinon, S., & Caldwell, D. G. (2010). Robot motor skill coordination with em-based reinforcement learning. In Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (IROS).
Kwee, I., Hutter, M., & Schmidhuber, J. (2001). Gradient-based reinforcement planning in policy-search methods. In M. A. Wiering (Ed.), Cognitieve Kunstmatige Intelligentie: Vol. 27. Proceedings of the 5th European workshop on reinforcement learning (EWRL) (pp. 27–29), Lugano. Manno: Onderwijsinsituut CKI, Utrecht University.
Lawrence, G., Cowan, N., & Russell, S. (2003). Efficient gradient estimation for motor control learning. In Proceedings of the international conference on uncertainty in artificial intelligence (UAI) (pp. 354–361), Acapulco, Mexico.
Martín, H. J. A., de Lope, J., & Maravall, D. (2009). The knn-td reinforcement learning algorithm. In Proceedings of the 3rd international work-conference on the interplay between natural and artificial computation (IWINAC) (pp. 305–314). Berlin: Springer.
McLachan, G. J., & Krishnan, T. (1997). Wiley series in probability and statistics. The EM algorithm and extensions. New York: Wiley.
Miyamoto, H., Schaal, S., Gandolfo, F., Gomi, H., Koike, Y., Osu, R., Nakano, E., Wada, Y., & Kawato, M. (1996). A Kendama learning robot based on bi-directional theory. Neural Networks, 9(8), 1281–1302.
Ng, A. Y., & Jordan, M. (2000). Pegasus: A policy search method for large mdps and pomdps. In Proceedings of the international conference on uncertainty in artificial intelligence (UAI) (pp. 406–415), Palo Alto, CA.
Ng, A. Y., Kim, H. J., Jordan, M. I., & Sastry, S. (2004). Inverted autonomous helicopter flight via reinforcement learning. In Proceedings of the international symposium on experimental robotics (ISER). Cambridge: MIT Press.
Park, D. H., Hoffmann, H., Pastor, P., & Schaal, S. (2008). Movement reproduction and obstacle avoidance with dynamic movement primitives and potential fields. In IEEE international conference on humanoid robots (HUMANOIDS) (pp. 91–98).
PASCAL2 (2010). Challenges. http://pascallin2.ecs.soton.ac.uk/Challenges/.
Peshkin, L. (2001). Reinforcement learning by policy search. PhD thesis, Brown University, Providence, RI.
Peters, J. (2007). Machine learning of motor skills for robotics. PhD thesis, University of Southern California, Los Angeles, CA, 90089, USA.
Peters, J., & Schaal, S. (2006). Policy gradient methods for robotics. In Proceedings of the IEEE/RSJ 2006 international conference on intelligent robots and systems (IROS) (pp. 2219–2225), Beijing, China.
Peters, J., & Schaal, S. (2007). Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the international conference on machine learning (ICML), Corvallis, OR, USA.
Peters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics. In Proceedings of the IEEE-RAS international conference on humanoid robots (HUMANOIDS) (pp. 103–123), Karlsruhe, Germany.
Peters, J., Vijayakumar, S., & Schaal, S. (2005). Natural actor-critic. In Proceedings of the European conference on machine learning (ECML) (pp. 280–291), Porto, Portugal.
Rückstieß, T., Felder, M., & Schmidhuber, J. (2008). State-dependent exploration for policy gradient methods. In Proceedings of the European conference on machine learning (ECML) (pp. 234–249), Antwerp, Belgium.
Sato, S., Sakaguchi, T., Masutani, Y., & Miyazaki, F. (1993). Mastering of a task with interaction between a robot and its environment: “kendama” task. Transactions of the Japan Society of Mechanical Engineers C, 59(558), 487–493.
Schaal, S., Atkeson, C. G., & Vijayakumar, S. (2002). Scalable techniques from nonparameteric statistics for real-time robot learning. Applied Intelligence, 17(1), 49–60.
Schaal, S., Peters, J., Nakanishi, J., & Ijspeert, A. J. (2003). Control, planning, learning, and imitation with dynamic movement primitives. In Proceedings of the workshop on bilateral paradigms on humans and humanoids, IEEE international conference on intelligent robots and systems (IROS), Las Vegas, NV, October 27–31, 2003.
Schaal, S., Mohajerian, P., & Ijspeert, A. J. (2007). Dynamics systems vs. optimal control—a unifying view. Progress in Brain Research, 165(1), 425–445.
Sehnke, F., Osendorfer, C., Rückstieß, T., Graves, A., Peters, J., & Schmidhuber, J. (2010). Parameter-exploring policy gradients. Neural Networks, 21(4), 551–559.
Shone, T., Krudysz, G., & Brown, K. (2000). Dynamic manipulation of Kendama (Tech. rep.). Rensselaer Polytechnic Institute.
Strens, M., & Moore, A. (2001). Direct policy search using paired statistical tests. In Proceedings of the 18th international conference on machine learning (ICML).
Sumners, C. (1997). Toys in space: exploring science with the astronauts. New York: McGraw-Hill.
Sutton, R., & Barto, A. (1998). Reinforcement learning. Cambridge: MIT Press.
Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the international machine learning conference (pp. 9–44).
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems (NIPS) (Vol. 13, pp. 1057–1063), Denver, CO, USA.
Takenaka, K. (1984). Dynamical control of manipulator with vision: “cup and ball” game demonstrated by robot. Transactions of the Japan Society of Mechanical Engineers C, 50(458), 2046–2053.
Taylor, M. E., Whiteson, S., & Stone, P. (2007). Transfer via inter-task mappings in policy search reinforcement learning. In Proceedings of the sixth international joint conference on autonomous agents and multiagent systems (AAMAS).
Tedrake, R., Zhang, T. W., & Seung, H. S. (2004). Stochastic policy gradient reinforcement learning on a simple 3d biped. In Proceedings of the IEEE 2004 international conference on intelligent robots and systems (IROS) (pp. 2849–2854).
Theodorou, E. A., Buchli, J., & Schaal, S. (2010). Reinforcement learning of motor skills in high dimensions: a path integral approach. In Proceedings of IEEE international conference on robotics and automation (ICRA) (pp. 2397–2403).
Toussaint, M., & Goerick, C. (2007). Probabilistic inference for structured planning in robotics. In Proceedings of the IEEE/RSJ 2007 international conference on intelligent robots and systems (IROS), San Diego, CA, USA.
Van Der Maaten, L., Postma, E., & Van Den Herik, H. (2007). Dimensionality reduction: a comparative review. Preprint.
Vlassis, N., Toussaint, M., Kontes, G., & Piperidis, S. (2009). Learning model-free robot control by a Monte Carlo EM algorithm. Autonomous Robots, 27(2), 123–130.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.
Wulf, G. (2007). Attention and motor skill learning. Champaign: Human Kinetics.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: S. Whiteson and M. Littman.
Rights and permissions
About this article
Cite this article
Kober, J., Peters, J. Policy search for motor primitives in robotics. Mach Learn 84, 171–203 (2011). https://doi.org/10.1007/s10994-010-5223-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-010-5223-6