Abstract
Controlling robots to perform tasks via natural language is one of the most challenging topics in human-robot interaction. In this work, we present a robot system that follows unconstrained language instructions to pick and place arbitrary objects and effectively resolves ambiguities through dialogues. Our approach infers objects and their relationships from input images and language expressions and can place objects in accordance with the spatial relations expressed by the user. Unlike previous approaches, we consider grounding not only for the picking but also for the placement of everyday objects from language. Specifically, by grounding objects and their spatial relations, we allow specification of complex placement instructions, e.g. “place it behind the middle red bowl”. Our results obtained using a real-world PR2 robot demonstrate the effectiveness of our method in understanding pick-and-place language instructions and sequentially composing them to solve tabletop manipulation tasks. Videos are available at http://speechrobot.cs.uni-freiburg.de.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Further quantitative experiments were infeasible at time of submission due to COVID-19.
References
Clark, H.H., Brennan, S.E.: Grounding in communication. Prespectives on Socially Shared Cognition (1991)
Guadarrama, S., Riano, L., Golland, D., Go, D., Jia, Y., Klein, D., Abbeel, P., Darrell, T.: Grounding spatial relations for human-robot interaction. In: IROS (2013)
Pangercic, D., Pitzer, B., Tenorth, M., Beetz, M.: Semantic object maps for robotic housework-representation, acquisition and use. In: IROS (2012)
Hatori, J., Kikuchi, Y., Kobayashi, S., Takahashi, K., Tsuboi, Y., Unno, Y., Ko, W., Tan, J.: Interactively picking real-world objects with unconstrained spoken language instructions. In: ICRA (2018)
Paul, R., Arkin, J., Roy, N., M Howard, T.: Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators. In: RSS (2016)
Jiang, Y., Lim, M., Zheng, C., Saxena, S.: Learning to place new objects in a scene. IJRR 31(9), 1021–1043 (2012)
Mees, O., Emek, A., Vertens, J., Burgard, W.: Learning object placements for relational instructions by hallucinating scene representations. In: ICRA (2020)
Mees, O., Abdo, N., Mazuran, M., Burgard, W.: Metric learning for generalizing spatial relations to new objects. In: IROS (2017)
Shridhar, M., Hsu, D.: Interactive visual grounding of referring expressions for human-robot interaction. In: RSS (2018)
Misra, D.K., Sung, J., Lee, K., Saxena, A.: Tell me dave: context-sensitive grounding of natural language to manipulation instructions. IJRR 35(1-3), 281–300 (2016)
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referring to objects in photographs of natural scenes. In: EMNLP, Referitgame (2014)
Antol, S., Agrawal, A., Jiasen, L., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Visual question answering. In: ICCV, Vqa (2015)
Johnson, J., Karpathy, A., Fei-Fei, L.: Fully convolutional localization networks for dense captioning. In: CVPR, Densecap (2016)
Anderson, P., Qi, W., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., van den Hengel, A.: Interpreting visually-grounded navigation instructions in real environments. In: CVPR, Vision-and-language navigation (2018)
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: ICML (2016)
Licheng, Y., Lin, Z., Shen, X., Yang, J., Xin, L., Bansal, M., Berg, T.L.: Modular attention network for referring expression comprehension. In: CVPR, Mattnet (2018)
Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: CVPR (2017)
Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: CVPR (2016)
Gualtieri, M., Ten Pas, A., Saenko, K., Platt, R.: High precision grasp pose detection in dense clutter. In: IROS (2016)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: CVPR (2017)
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)
Yu, L., Tan, H., Bansal, M., Berg, T.L.: A joint speaker-listener-reinforcer model for referring expressions. In: CVPR (2017)
Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997)
Mees, O., Merklinger, M., Kalweit, G., Burgard, W.: Unsupervised robot skill learning from videos. In: ICRA, Adversarial skill networks (2020)
Abdo, N., Stachniss, C., Spinello, L., Burgard, W.: Organizing objects by predicting user preferences through collaborative filtering. IJRR 35(13), 1587–1608 (2016)
Haustein, J.A., Hang, K., Stork, J.A., Kragic, D.: Object placement planning and optimization for robot manipulators. In: IROS (2019)
Nematollahi, I., Mees, O., Hermann, L., Burgard, W.: Unsupervised structured dynamics models from physical interaction. In: IROS, Hindsight for foresight (2020)
Mees, O., Tatarchenko, M., Brox, T., Burgard, W.: Self-supervised 3d shape and viewpoint estimation from single images for robotics. In: IROS (2019)
Varley, J., DeChant, C., Richardson, A., Ruales, J., Allen, P.: Shape completion enabled robotic grasping. In: IROS (2017)
Mousavian, A., Eppner, C., Fox, D.: 6-dof graspnet: variational grasp generation for object manipulation. In: ICCV (2019)
Lynch, C., Sermanet, P.: Grounding language in play. arXiv preprint arXiv:2005.07648 (2020)
Shao, L., Migimatsu, T., Zhang, Q., Yang, K., Bohg, J.: Concept2robot: learning manipulation concepts from instructions and human demonstrations. In: RSS (2020)
Acknowledgments
This work has been supported partly by the Freiburg Graduate School of Robotics and the German Federal Ministry of Education and Research under contract number 01IS18040B-OML. We thank Henrich Kolkhorst for his contributions to the speech-to-text pipeline and to Andreas Eitel for valuable discussions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mees, O., Burgard, W. (2021). Composing Pick-and-Place Tasks by Grounding Language. In: Siciliano, B., Laschi, C., Khatib, O. (eds) Experimental Robotics. ISER 2020. Springer Proceedings in Advanced Robotics, vol 19. Springer, Cham. https://doi.org/10.1007/978-3-030-71151-1_43
Download citation
DOI: https://doi.org/10.1007/978-3-030-71151-1_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71150-4
Online ISBN: 978-3-030-71151-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)