Policy Feedback for the Refinement of Learned Motion Control on a Mobile Robot | International Journal of Social Robotics Skip to main content

Advertisement

Log in

Policy Feedback for the Refinement of Learned Motion Control on a Mobile Robot

  • Published:
International Journal of Social Robotics Aims and scope Submit manuscript

Abstract

Motion control is fundamental to mobile robots, and the associated challenge in development can be assisted by the incorporation of execution experience to increase policy robustness. In this work, we present an approach that updates a policy learned from demonstration with human teacher feedback. We contribute advice-operators as a feedback form that provides corrections on state-action pairs produced during a learner execution, and Focused Feedback for Mobile Robot Policies (F3MRP) as a framework for providing feedback to rapidly-sampled policies. Both are appropriate for mobile robot motion control domains. We present a general feedback algorithm in which multiple types of feedback, including advice-operators, are provided through the F3MRP framework, and shown to improve policies initially derived from a set of behavior examples. A comparison to providing more behavior examples instead of more feedback finds data to be generated in different areas of the state and action spaces, and feedback to be more effective at improving policy performance while producing smaller datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. The F3MRP framework was developed within the GNU Octave scientific language [14].

  2. The empirical validations of Sect. 4.2 employ lazy learning regression techniques [6]; specifically, a form of locally weighted averaging. Incremental policy updating is particularly straightforward under lazy learning regression, since explicit rederivation is not required; policy derivation happens at execution time and so a complete policy update is accomplished by simply adding new data to the set.

  3. The positive credit flag adds the execution point, unmodified, to the dataset; and thus may equivalently be viewed as an identity function advice-operator, i.e. f(z,a)=(z,a).

  4. This scale becomes finer, and association with the underlying data trickier, if a single value is intended to be somehow distributed across only a portion of the execution states; akin to the RL issue of reward back-propagation.

  5. A Poisson formulation was chosen since the distance calculations never fall below, and often cluster near, zero. To estimate λ, frequency counts were computed for k bins (uniformly sized) of distance data (k=50).

  6. The traces ξ d and ξ p correspond respectively to the “Prediction Data” and “Position Data” in Fig. 1. Similarly, the trace subsets \(\hat{\xi}_{d}=\nobreak\{x,y,\theta\}_{\varPhi}\) and \(\hat{\xi}_{p} =\{\mathbf{z},\mathbf{a}\}_{\varPhi}\).

  7. Here an earlier version of F3MRP was employed, that did not provide visual dataset support or interactive tagging.

  8. The same teacher (one of the authors) was used to provide both demonstration and feedback.

  9. Full domain, and algorithm, details may be found in [4].

  10. The exceptions being when the entire learner execution receives a correction, or when the teacher provides a demonstration for only the beginning portion of an execution.

  11. In Table 2, operators 0–5 are the baseline operators and operators 6–8 were built through operator-scaffolding.

  12. Note that operator composition is not transitive.

  13. The limit being the number of unique combinations of the parameters of the child operators.

  14. If a constant value for the rate of change in action dimension j is not defined for the robot system, reasonable options for this value include, for example, average rate of change seen during the demonstrations.

  15. The value γ j,max is defined either by the physical constraints of the robot, or artificially by the control system.

References

  1. Abbeel P, Coates A, Quigley M, Ng AY (2007) An application of reinforcement learning to aerobatic helicopter flight. In: Proceedings of advances in neural information processing

    Google Scholar 

  2. Argall B, Browning B, Veloso M (2008) Learning robot motion control with demonstration and advice-operators. In: Proceedings of the IEEE/RSJ international conference on intelligent robots and systems

    Google Scholar 

  3. Argall B, Browning B, Veloso M (2009) Automatic weight learning for multiple data sources when learning from demonstration. In: Proceedings of the IEEE international conference on robotics and automation

    Google Scholar 

  4. Argall B, Browning B, Veloso M (2011) Teacher feedback to scaffold and refine demonstrated motion primitives on a mobile robot. Robot Auton Syst 59(3–4):243–255

    Article  Google Scholar 

  5. Argall B, Chernova S, Veloso M, Browning B (2009) A survey of robot learning from demonstration. Robot Auton Syst 57(5):469–483

    Article  Google Scholar 

  6. Atkeson CG, Moore AW, Schaal S (1997) Locally weighted learning. Artif Intell Rev 11:11–73

    Article  Google Scholar 

  7. Atkeson CG, Schaal S (1997) Robot learning from demonstration. In: Proceedings of the fourteenth international conference on machine learning (ICML’97)

    Google Scholar 

  8. Bagnell JA, Schneider JG (2001) Autonomous helicopter control using reinforcement learning policy search methods. In: Proceedings of the IEEE international conference on robotics and automation

    Google Scholar 

  9. Bentivegna DC (2004) Learning from observation using primitives. Ph.D. thesis, College of Computing, Georgia Institute of Technology, Atlanta, GA

  10. Billard A, Callinon S, Dillmann R, Schaal S (2008) Robot programming by demonstration. In: Siciliano B, Khatib O (eds) Handbook of robotics. Springer, New York, Chap. 59

    Google Scholar 

  11. Breazeal C, Scassellati B (2002) Robots that imitate humans. Trends Cogn Sci 6(11):481–487

    Article  Google Scholar 

  12. Calinon S, Billard A (2007) Incremental learning of gestures by imitation in a humanoid robot. In: Proceedings of the 2nd ACM/IEEE international conference on human-robot interactions

    Google Scholar 

  13. Chernova S, Veloso M (2008) Learning equivalent action choices from demonstration. In: Proceedings of the IEEE/RSJ international conference on intelligent robots and systems

    Google Scholar 

  14. Eaton JW (2002) GNU Octave Manual. Network Theory Limited

  15. Grollman DH, Jenkins OC (2007) Dogged learning for robots. In: Proceedings of the IEEE international conference on robotics and automation

    Google Scholar 

  16. Ijspeert AJ, Nakanishi J, Schaal S (2002) Learning rhythmic movements by demonstration using nonlinear oscillators. In: Proceedings of the IEEE/RSJ international conference on intelligent robots and systems

    Google Scholar 

  17. Kober J, Peters J (2009) Learning motor primitives for robotics. In: Proceedings of the IEEE international conference on robotics and automation

    Google Scholar 

  18. Kolter JZ, Abbeel P, Ng AY (2008) Hierarchical apprenticeship learning with application to quadruped locomotion. In: Proceedings of advances in neural information processing

    Google Scholar 

  19. Matarić MJ (2002) Sensory-motor primitives as a basis for learning by imitation: Linking perception to action and biology to robotics. In: Dautenhahn K, Nehaniv CL (eds) Imitation in animals and artifacts. MIT Press, Cambridge, Chap. 15

    Google Scholar 

  20. Nehaniv CL, Dautenhahn K (2002) The correspondence problem. In: Dautenhahn K, Nehaniv CL (eds) Imitation in animals and artifacts. MIT Press, Cambridge, Chap. 2

    Google Scholar 

  21. Nicolescu M, Mataric M (2003) Methods for robot task learning: Demonstrations, generalization and practice. In: Proceedings of the second international joint conference on autonomous agents and multi-agent systems

    Google Scholar 

  22. Pastor P, Kalakrishnan M, Chitta S, Theodorou E, Schaal S (2011) Skill learning and task outcome prediction for manipulation. In: Proceedings of IEEE international conference on robotics and automation

    Google Scholar 

  23. Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing 71(7–9):1180–1190

    Article  Google Scholar 

  24. Ratliff N, Bradley D, Bagnell JA, Chestnutt J (2007) Boosting structured prediction for imitation learning. In: Proceedings of advances in neural information processing systems

    Google Scholar 

  25. Smart WD (2002) Making reinforcement learning work on real robots. Ph.D. thesis, Department of Computer Science, Brown University, Providence, RI

Download references

Acknowledgements

The research is partly sponsored by the Boeing Corporation under Grant No. CMU-BA-GTA-1, BBNT Solutions under subcontract No. 950008572, via prime Air Force contract No. SA-8650-06-C-7606, the United States Department of the Interior under Grant No. NBCH-1040007 and the Qatar Foundation for Education, Science and Community Development. The views and conclusions contained in this document are solely those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Brenna D. Argall.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Argall, B.D., Browning, B. & Veloso, M.M. Policy Feedback for the Refinement of Learned Motion Control on a Mobile Robot. Int J of Soc Robotics 4, 383–395 (2012). https://doi.org/10.1007/s12369-012-0156-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12369-012-0156-9

Keywords

Navigation