Learning Predictive Models from Observation and Interaction

Schmeckpeper, Karl; Xie, Annie; Rybkin, Oleh; Tian, Stephen; Daniilidis, Kostas; Levine, Sergey; Finn, Chelsea

doi:10.1007/978-3-030-58565-5_42

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12365))

Included in the following conference series:

European Conference on Computer Vision

4187 Accesses
7 Citations

Abstract

Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works, and then use this learned model to plan coordinated sequences of actions to bring about desired outcomes. However, learning a model that captures the dynamics of complex skills represents a major challenge: if the agent needs a good model to perform these skills, it might never be able to collect the experience on its own that is required to learn these delicate and complex behaviors. Instead, we can imagine augmenting the training set with observational data of other agents, such as humans. Such data is likely more plentiful, but cannot always be combined with data from the original agent. For example, videos of humans might show a robot how to use a tool, but (i) are not annotated with suitable robot actions, and (ii) contain a systematic distributional shift due to the embodiment differences between humans and robots. We address the first challenge by formulating the corresponding graphical model and treating the action as an observed variable for the interaction data and an unobserved variable for the observation data, and the second challenge by using a domain-dependent prior. In addition to interaction data, our method is able to leverage videos of passive observations in a driving dataset and a dataset of robotic manipulation videos to improve video prediction performance. In a real-world tabletop robotic manipulation setting, our method is able to significantly improve control performance by learning a model from both robot data and observations of humans.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Track2Act: Predicting Point Tracks from Internet Videos Enables Generalizable Robot Manipulation

Learning to Identify Physical Parameters from Video Using Differentiable Physics

Physical Representation Learning and Parameter Identification from Video Using Differentiable Physics

Article Open access 17 October 2021

Notes

1.
Data will be made available at https://sites.google.com/view/lpmfoai.

References

van Amersfoort, J., Kannan, A., Ranzato, M., Szlam, A., Tran, D., Chintala, S.: Transformation-based models of video sequences. arXiv preprint, January 2017. http://arxiv.org/abs/1701.08435
Aytar, Y., Pfaff, T., Budden, D., Paine, T., Wang, Z., de Freitas, N.: Playing hard exploration games by watching YouTube. In: Advances in Neural Information Processing Systems 31 (2018). http://papers.nips.cc/paper/7557-playing-hard-exploration-games-by-watching-youtube.pdf
Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. In: International Conference on Learning Representations (2018)
Google Scholar
Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D.: Unsupervised pixel-level domain adaptation with generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Byeon, W., Wang, Q., Srivastava, R.K., Koumoutsakos, P.: ContextVP: fully context-aware video prediction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 781–797. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_46
Chapter Google Scholar
Byravan, A., Leeb, F., Meier, F., Fox, D.: SE3-Pose-Nets: structured deep dynamics models for visuomotor control. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE (2018)
Google Scholar
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027 (2019)
Castrejon, L., Ballas, N., Courville, A.: Improved conditional VRNNs for video prediction. arXiv preprint, April 2019. http://arxiv.org/abs/1904.12165
Chen, B., Wang, W., Wang, J., Chen, X.: Video imagination from a single image with transformation generation. arXiv preprint, June 2017. http://arxiv.org/abs/1706.04124
Chiappa, S., Racanière, S., Wierstra, D., Mohamed, S.: Recurrent environment simulators. In: International Conference on Learning Representations (2017)
Google Scholar
Dasari, S., et al.: RoboNet: large-scale multi-robot learning. In: Conference on Robot Learning, October 2019. http://arxiv.org/abs/1910.11215
De Brabandere, B., Jia, X., Tuytelaars, T., Van Gool, L.: Dynamic filter networks. In: Neural Information Processing Systems, May 2016. http://arxiv.org/abs/1605.09673
Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: International Conference on Machine Learning (ICML) (2018)
Google Scholar
Denton, E., Birodkar, V.: Unsupervised learning of disentangled representations from video. In: Neural Information Processing Systems, pp. 4417–4426 (2017)
Google Scholar
Dwibedi, D., Tompson, J., Lynch, C., Sermanet, P.: Learning actionable representations from visual observations. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1577–1584. IEEE (2018). https://arxiv.org/abs/1808.00928
Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., Levine, S.: Visual foresight: model-based deep reinforcement learning for vision-based robotic control. arXiv:1812.00568 (2018)
Edwards, A.D., Sahni, H., Schroecker, Y., Isbell, C.L.: Imitating latent policies from observation. In: International Conference on Machine Learning, May 2019. http://arxiv.org/abs/1805.07914
Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Neural Information Processing Systems (2016)
Google Scholar
Finn, C., Levine, S.: Deep visual foresight for planning robot motion. In: Proceedings of International Conference in Robotics and Automation (ICRA) (2017)
Google Scholar
Fragkiadaki, K., Agrawal, P., Levine, S., Malik, J.: Learning visual predictive models of physics for playing billiards. In: International Conference on Learning Representations, November 2016. http://arxiv.org/abs/1511.07404
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)
Google Scholar
Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 2096–2030 (2016)
MathSciNet Google Scholar
Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution. In: Neural Information Processing Systems (2018)
Google Scholar
Hafner, D., et al.: Learning latent dynamics for planning from pixels. In: International Conference on Machine Learning, pp. 2555–2565. PMLR (2019)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hoffman, J., et al.: CyCADA: cycle-consistent adversarial domain adaptation. In: International Conference on Machine Learning (ICML), November 2018. http://arxiv.org/abs/1711.03213
Ilse, M., Tomczak, J.M., Louizos, C., Welling, M.: DIVA: domain invariant variational autoencoders. arXiv preprint, May 2019. http://arxiv.org/abs/1905.10427
Janner, M., Fu, J., Zhang, M., Levine, S.: When to trust your model: model-based policy optimization. In: NeurIPS (2019)
Google Scholar
Janner, M., Levine, S., Freeman, W.T., Tenenbaum, J.B., Finn, C., Wu, J.: Reasoning about physical interactions with object-oriented prediction and planning. In: International Conference on Learning Representations, December 2019. http://arxiv.org/abs/1812.10972
Kaiser, L., et al.: Model-based reinforcement learning for Atari. In: International Conference on Learning Representations (2019)
Google Scholar
Kalchbrenner, N., et al.: Video pixel networks. arXiv preprint, October 2016. http://arxiv.org/abs/1610.00527
Kumar, A., Gupta, S., Malik, J.: Learning navigation subroutines by watching videos. CoRR abs/1905.12612 (2019). http://arxiv.org/abs/1905.12612
Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. arXiv:1804.01523 abs/1804.01523 (2018)
Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion GAN for future-flow embedded video prediction. In: International Conference on Computer Vision, August 2017. http://arxiv.org/abs/1708.00284
Liu, Y., Gupta, A., Abbeel, P., Levine, S.: Imitation from observation: learning to imitate behaviors from raw video via context translation. Ph.D. thesis, University of California, Berkeley, July 2018. http://arxiv.org/abs/1707.03374
Liu, Z., Yeh, R.A., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4463–4471 (2017)
Google Scholar
Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint, May 2016. http://arxiv.org/abs/1605.08104
Lu, C., Hirsch, M., Scholkoph, B.: Flexible spatio-temporal networks for video prediction. In: Computer Vision and Pattern Recognition (2017)
Google Scholar
Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: International Conference on Computer Vision, March 2017. http://arxiv.org/abs/1703.07684
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: International Conference on Learning Representations (2016)
Google Scholar
Oh, J., Guo, X., Lee, H., Lewis, R., Singh, S.: Action-conditional video prediction using deep networks in atari games. In: Neural Information Processing Systems (2015)
Google Scholar
Patraucean, V., Handa, A., Cipolla, R.: Spatio-temporal video autoencoder with differentiable memory. arXiv preprint, November 2015. http://arxiv.org/abs/1511.06309
Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)
Rizzolatti, G., Craighero, L.: The mirror-neuron system. Annu. Rev. Neurosci. 27, 169–192 (2004)
Article Google Scholar
Rizzolatti, G., Fadiga, L., Gallese, V., Fogassi, L.: Premotor cortex and the recognition of motor actions. Cogn. Brain. Res. 3(2), 131–141 (1996)
Article Google Scholar
Rybkin, O., Pertsch, K., Derpanis, K.G., Daniilidis, K., Jaegle, A.: Learning what you can do before doing anything. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=SylPMnR9Ym
Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: Proceedings of International Conference in Robotics and Automation (ICRA) (2018). http://arxiv.org/abs/1704.06888
Sharma, P., Pathak, D., Gupta, A.: Third-person visual imitation learning via decoupled hierarchical controller. In: Neural Information Processing Systems (2019)
Google Scholar
Shon, A.P., Grochow, K., Hertzmann, A., Rao, R.P.: Learning shared latent structure for image synthesis and robotic imitation. In: Advances in Neural Information Processing Systems, pp. 1233–1240 (2005)
Google Scholar
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning (ICML) (2015)
Google Scholar
Stadie, B.C., Abbeel, P., Sutskever, I.: Third-person imitation learning. arXiv preprint arXiv:1703.01703 (2017)
Sun, M., Ma, X.: Adversarial imitation learning from incomplete demonstrations. In: International Joint Conference on Artificial Intelligence, May 2019. http://arxiv.org/abs/1905.12310
Sun, W., Vemula, A., Boots, B., Bagnell, J.A.: Provably efficient imitation learning from observation alone. In: International Conference on Machine Learning, May 2019. http://arxiv.org/abs/1905.10948
Taigman, Y., Polyak, A., Wolf, L.: Unsupervised cross-domain image generation. In: International Conference on Learning Representations, November 2017. https://arxiv.org/abs/1611.02200
Torabi, F., Warnell, G., Stone, P.: Behavioral cloning from observation. In: International Joint Conference on Artificial Intelligence, May 2018. http://arxiv.org/abs/1805.01954
Torabi, F., Warnell, G., Stone, P.: Generative adversarial imitation from observation. arXiv preprint, July 2018. http://arxiv.org/abs/1807.06158
Torabi, F., Warnell, G., Stone, P.: Imitation learning from video by leveraging proprioception. In: International Joint Conference on Artificial Intelligence, May 2019. http://arxiv.org/abs/1905.09335
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition (2018)
Google Scholar
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Computer Vision and Pattern Recognition, February 2017. http://arxiv.org/abs/1702.05464
Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion: maximizing for domain invariance. arXiv preprint, December 2014. http://arxiv.org/abs/1412.3474
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: International Conference on Learning Representations (2017)
Google Scholar
Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: Computer Vision and Pattern Recognition (2016)
Google Scholar
Vondrick, C., Torralba, A.: Generating the future with adversarial transformers. In: Conference on Vision and Pattern Recognition (2017)
Google Scholar
Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_51. http://arxiv.org/abs/1606.07873
Wang, A., Kurutach, T., Tamar, A., Abbeel, P.: Learning robotic manipulation through visual planning and acting. In: Robotics: Science and Systems (2019)
Google Scholar
Watter, M., Springenberg, J.T., Boedecker, J., Riedmiller, M.: Embed to control: a locally linear latent dynamics model for control from raw images. In: Neural Information Processing Systems (2015)
Google Scholar
Wichers, N., Villegas, R., Erhan, D., Lee, H.: Hierarchical long-term video prediction without supervision. In: ICML (2018)
Google Scholar
Xie, A., Ebert, F., Levine, S., Finn, C.: Improvisation through physical understanding: using novel objects as tools with visual foresight. In: Robotics: Science and Systems, April 2019. http://arxiv.org/abs/1904.05538
Xue, T., Wu, J., Bouman, K.L., Freeman, W.T.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. (2016). http://arxiv.org/abs/1607.02586
Yen-Chen, L., Bauza, M., Isola, P.: Experience-embedded visual foresight. In: Conference on Robot Learning, November 2019. http://arxiv.org/abs/1911.05071
Yu, F., et al.: BDD100K: a diverse driving video database with scalable annotation tooling. arXiv preprint, May 2018. http://arxiv.org/abs/1805.04687
Yu, T., et al.: One-shot imitation from observing humans via domain-adaptive meta-learning. In: Robotics: Science and Systems, February 2018. http://arxiv.org/abs/1802.01557
Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M.J., Levine, S.: SOLAR: deep structured representations for model-based reinforcement learning. In: International Conference on Machine Learning, August 2018. http://arxiv.org/abs/1808.09105
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Computer Vision and Pattern Recognition (2018)
Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Zhuang, F., Cheng, X., Luo, P., Pan, S.J., He, Q.: Supervised representation learning with double encoding-layer autoencoder for transfer learning. In: International Joint Conference on Artificial Intelligence (2015). https://doi.org/10.1145/3108257

Download references

Acknowledgements

We thank Karl Pertsch, Drew Jaegle, Marvin Zhang, and Kenneth Chaney. This work was supported by the NSF GRFP, ARL RCTA W911NF-10-2-0016, ARL DCIST CRA W911NF-17-2-0181, and by Honda Research Institute.

Author information

Authors and Affiliations

University of Pennsylvania, Philadelphia, PA, USA
Karl Schmeckpeper, Oleh Rybkin & Kostas Daniilidis
Stanford University, Stanford, CA, USA
Stephen Tian & Chelsea Finn
University of California, Berkeley, Berkeley, CA, USA
Annie Xie & Sergey Levine

Authors

Karl Schmeckpeper
View author publications
You can also search for this author in PubMed Google Scholar
Annie Xie
View author publications
You can also search for this author in PubMed Google Scholar
Oleh Rybkin
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Tian
View author publications
You can also search for this author in PubMed Google Scholar
Kostas Daniilidis
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Levine
View author publications
You can also search for this author in PubMed Google Scholar
Chelsea Finn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Karl Schmeckpeper .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 2101 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schmeckpeper, K. et al. (2020). Learning Predictive Models from Observation and Interaction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12365. Springer, Cham. https://doi.org/10.1007/978-3-030-58565-5_42

Download citation

DOI: https://doi.org/10.1007/978-3-030-58565-5_42
Published: 12 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58564-8
Online ISBN: 978-3-030-58565-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Predictive Models from Observation and Interaction

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Track2Act: Predicting Point Tracks from Internet Videos Enables Generalizable Robot Manipulation

Learning to Identify Physical Parameters from Video Using Differentiable Physics

Physical Representation Learning and Parameter Identification from Video Using Differentiable Physics

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (zip 2101 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Learning Predictive Models from Observation and Interaction

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Track2Act: Predicting Point Tracks from Internet Videos Enables Generalizable Robot Manipulation

Learning to Identify Physical Parameters from Video Using Differentiable Physics

Physical Representation Learning and Parameter Identification from Video Using Differentiable Physics

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (zip 2101 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation