Composing Pick-and-Place Tasks by Grounding Language

Mees, Oier; Burgard, Wolfram

doi:10.1007/978-3-030-71151-1_43

Oier Mees¹³ &
Wolfram Burgard^13,14

Part of the book series: Springer Proceedings in Advanced Robotics ((SPAR,volume 19))

Included in the following conference series:

International Symposium on Experimental Robotics

2505 Accesses
9 Citations

Abstract

Controlling robots to perform tasks via natural language is one of the most challenging topics in human-robot interaction. In this work, we present a robot system that follows unconstrained language instructions to pick and place arbitrary objects and effectively resolves ambiguities through dialogues. Our approach infers objects and their relationships from input images and language expressions and can place objects in accordance with the spatial relations expressed by the user. Unlike previous approaches, we consider grounding not only for the picking but also for the placement of everyday objects from language. Specifically, by grounding objects and their spatial relations, we allow specification of complex placement instructions, e.g. “place it behind the middle red bowl”. Our results obtained using a real-world PR2 robot demonstrate the effectiveness of our method in understanding pick-and-place language instructions and sequentially composing them to solve tabletop manipulation tasks. Videos are available at http://speechrobot.cs.uni-freiburg.de.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 32031; Price includes VAT (Japan)

Softcover Book: JPY 40039; Price includes VAT (Japan)

Hardcover Book: JPY 40039; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Human–robot interaction through joint robot planning with large language models

Article 17 January 2025

Natural Language-Guided Semantic Navigation Using Scene Graph

AI Planning from Natural-Language Instructions for Trustworthy Human-Robot Communication

Notes

1.
Further quantitative experiments were infeasible at time of submission due to COVID-19.

References

Clark, H.H., Brennan, S.E.: Grounding in communication. Prespectives on Socially Shared Cognition (1991)
Google Scholar
Guadarrama, S., Riano, L., Golland, D., Go, D., Jia, Y., Klein, D., Abbeel, P., Darrell, T.: Grounding spatial relations for human-robot interaction. In: IROS (2013)
Google Scholar
Pangercic, D., Pitzer, B., Tenorth, M., Beetz, M.: Semantic object maps for robotic housework-representation, acquisition and use. In: IROS (2012)
Google Scholar
Hatori, J., Kikuchi, Y., Kobayashi, S., Takahashi, K., Tsuboi, Y., Unno, Y., Ko, W., Tan, J.: Interactively picking real-world objects with unconstrained spoken language instructions. In: ICRA (2018)
Google Scholar
Paul, R., Arkin, J., Roy, N., M Howard, T.: Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators. In: RSS (2016)
Google Scholar
Jiang, Y., Lim, M., Zheng, C., Saxena, S.: Learning to place new objects in a scene. IJRR 31(9), 1021–1043 (2012)
Google Scholar
Mees, O., Emek, A., Vertens, J., Burgard, W.: Learning object placements for relational instructions by hallucinating scene representations. In: ICRA (2020)
Google Scholar
Mees, O., Abdo, N., Mazuran, M., Burgard, W.: Metric learning for generalizing spatial relations to new objects. In: IROS (2017)
Google Scholar
Shridhar, M., Hsu, D.: Interactive visual grounding of referring expressions for human-robot interaction. In: RSS (2018)
Google Scholar
Misra, D.K., Sung, J., Lee, K., Saxena, A.: Tell me dave: context-sensitive grounding of natural language to manipulation instructions. IJRR 35(1-3), 281–300 (2016)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referring to objects in photographs of natural scenes. In: EMNLP, Referitgame (2014)
Google Scholar
Antol, S., Agrawal, A., Jiasen, L., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Visual question answering. In: ICCV, Vqa (2015)
Google Scholar
Johnson, J., Karpathy, A., Fei-Fei, L.: Fully convolutional localization networks for dense captioning. In: CVPR, Densecap (2016)
Google Scholar
Anderson, P., Qi, W., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., van den Hengel, A.: Interpreting visually-grounded navigation instructions in real environments. In: CVPR, Vision-and-language navigation (2018)
Google Scholar
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: ICML (2016)
Google Scholar
Licheng, Y., Lin, Z., Shen, X., Yang, J., Xin, L., Bansal, M., Berg, T.L.: Modular attention network for referring expression comprehension. In: CVPR, Mattnet (2018)
Google Scholar
Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: CVPR (2017)
Google Scholar
Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: CVPR (2016)
Google Scholar
Gualtieri, M., Ten Pas, A., Saenko, K., Platt, R.: High precision grasp pose detection in dense clutter. In: IROS (2016)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: CVPR (2017)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)
Google Scholar
Yu, L., Tan, H., Bansal, M., Berg, T.L.: A joint speaker-listener-reinforcer model for referring expressions. In: CVPR (2017)
Google Scholar
Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997)
Google Scholar
Mees, O., Merklinger, M., Kalweit, G., Burgard, W.: Unsupervised robot skill learning from videos. In: ICRA, Adversarial skill networks (2020)
Google Scholar
Abdo, N., Stachniss, C., Spinello, L., Burgard, W.: Organizing objects by predicting user preferences through collaborative filtering. IJRR 35(13), 1587–1608 (2016)
Google Scholar
Haustein, J.A., Hang, K., Stork, J.A., Kragic, D.: Object placement planning and optimization for robot manipulators. In: IROS (2019)
Google Scholar
Nematollahi, I., Mees, O., Hermann, L., Burgard, W.: Unsupervised structured dynamics models from physical interaction. In: IROS, Hindsight for foresight (2020)
Google Scholar
Mees, O., Tatarchenko, M., Brox, T., Burgard, W.: Self-supervised 3d shape and viewpoint estimation from single images for robotics. In: IROS (2019)
Google Scholar
Varley, J., DeChant, C., Richardson, A., Ruales, J., Allen, P.: Shape completion enabled robotic grasping. In: IROS (2017)
Google Scholar
Mousavian, A., Eppner, C., Fox, D.: 6-dof graspnet: variational grasp generation for object manipulation. In: ICCV (2019)
Google Scholar
Lynch, C., Sermanet, P.: Grounding language in play. arXiv preprint arXiv:2005.07648 (2020)
Shao, L., Migimatsu, T., Zhang, Q., Yang, K., Bohg, J.: Concept2robot: learning manipulation concepts from instructions and human demonstrations. In: RSS (2020)
Google Scholar

Download references

Acknowledgments

This work has been supported partly by the Freiburg Graduate School of Robotics and the German Federal Ministry of Education and Research under contract number 01IS18040B-OML. We thank Henrich Kolkhorst for his contributions to the speech-to-text pipeline and to Andreas Eitel for valuable discussions.

Author information

Authors and Affiliations

University of Freiburg, Freiburg, Germany
Oier Mees & Wolfram Burgard
Toyota Research Institute, Toyota, USA
Wolfram Burgard

Authors

Oier Mees
View author publications
You can also search for this author in PubMed Google Scholar
Wolfram Burgard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Oier Mees .

Editor information

Editors and Affiliations

Department of Electrical Engineering and Information Technology, University of Naples Federico II, Naples, Italy
Bruno Siciliano
Department of Mechanical Engineering, National University of Singapore, Singapore, Singapore
Cecilia Laschi
Department of Computer Science, Stanford University, Stanford, CA, USA
Oussama Khatib

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mees, O., Burgard, W. (2021). Composing Pick-and-Place Tasks by Grounding Language. In: Siciliano, B., Laschi, C., Khatib, O. (eds) Experimental Robotics. ISER 2020. Springer Proceedings in Advanced Robotics, vol 19. Springer, Cham. https://doi.org/10.1007/978-3-030-71151-1_43

Download citation

DOI: https://doi.org/10.1007/978-3-030-71151-1_43
Published: 28 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71150-4
Online ISBN: 978-3-030-71151-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics