Abstract
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works, and then use this learned model to plan coordinated sequences of actions to bring about desired outcomes. However, learning a model that captures the dynamics of complex skills represents a major challenge: if the agent needs a good model to perform these skills, it might never be able to collect the experience on its own that is required to learn these delicate and complex behaviors. Instead, we can imagine augmenting the training set with observational data of other agents, such as humans. Such data is likely more plentiful, but cannot always be combined with data from the original agent. For example, videos of humans might show a robot how to use a tool, but (i) are not annotated with suitable robot actions, and (ii) contain a systematic distributional shift due to the embodiment differences between humans and robots. We address the first challenge by formulating the corresponding graphical model and treating the action as an observed variable for the interaction data and an unobserved variable for the observation data, and the second challenge by using a domain-dependent prior. In addition to interaction data, our method is able to leverage videos of passive observations in a driving dataset and a dataset of robotic manipulation videos to improve video prediction performance. In a real-world tabletop robotic manipulation setting, our method is able to significantly improve control performance by learning a model from both robot data and observations of humans.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Data will be made available at https://sites.google.com/view/lpmfoai.
References
van Amersfoort, J., Kannan, A., Ranzato, M., Szlam, A., Tran, D., Chintala, S.: Transformation-based models of video sequences. arXiv preprint, January 2017. http://arxiv.org/abs/1701.08435
Aytar, Y., Pfaff, T., Budden, D., Paine, T., Wang, Z., de Freitas, N.: Playing hard exploration games by watching YouTube. In: Advances in Neural Information Processing Systems 31 (2018). http://papers.nips.cc/paper/7557-playing-hard-exploration-games-by-watching-youtube.pdf
Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. In: International Conference on Learning Representations (2018)
Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D.: Unsupervised pixel-level domain adaptation with generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Byeon, W., Wang, Q., Srivastava, R.K., Koumoutsakos, P.: ContextVP: fully context-aware video prediction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 781–797. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_46
Byravan, A., Leeb, F., Meier, F., Fox, D.: SE3-Pose-Nets: structured deep dynamics models for visuomotor control. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE (2018)
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027 (2019)
Castrejon, L., Ballas, N., Courville, A.: Improved conditional VRNNs for video prediction. arXiv preprint, April 2019. http://arxiv.org/abs/1904.12165
Chen, B., Wang, W., Wang, J., Chen, X.: Video imagination from a single image with transformation generation. arXiv preprint, June 2017. http://arxiv.org/abs/1706.04124
Chiappa, S., Racanière, S., Wierstra, D., Mohamed, S.: Recurrent environment simulators. In: International Conference on Learning Representations (2017)
Dasari, S., et al.: RoboNet: large-scale multi-robot learning. In: Conference on Robot Learning, October 2019. http://arxiv.org/abs/1910.11215
De Brabandere, B., Jia, X., Tuytelaars, T., Van Gool, L.: Dynamic filter networks. In: Neural Information Processing Systems, May 2016. http://arxiv.org/abs/1605.09673
Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: International Conference on Machine Learning (ICML) (2018)
Denton, E., Birodkar, V.: Unsupervised learning of disentangled representations from video. In: Neural Information Processing Systems, pp. 4417–4426 (2017)
Dwibedi, D., Tompson, J., Lynch, C., Sermanet, P.: Learning actionable representations from visual observations. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1577–1584. IEEE (2018). https://arxiv.org/abs/1808.00928
Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., Levine, S.: Visual foresight: model-based deep reinforcement learning for vision-based robotic control. arXiv:1812.00568 (2018)
Edwards, A.D., Sahni, H., Schroecker, Y., Isbell, C.L.: Imitating latent policies from observation. In: International Conference on Machine Learning, May 2019. http://arxiv.org/abs/1805.07914
Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Neural Information Processing Systems (2016)
Finn, C., Levine, S.: Deep visual foresight for planning robot motion. In: Proceedings of International Conference in Robotics and Automation (ICRA) (2017)
Fragkiadaki, K., Agrawal, P., Levine, S., Malik, J.: Learning visual predictive models of physics for playing billiards. In: International Conference on Learning Representations, November 2016. http://arxiv.org/abs/1511.07404
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)
Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 2096–2030 (2016)
Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution. In: Neural Information Processing Systems (2018)
Hafner, D., et al.: Learning latent dynamics for planning from pixels. In: International Conference on Machine Learning, pp. 2555–2565. PMLR (2019)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Hoffman, J., et al.: CyCADA: cycle-consistent adversarial domain adaptation. In: International Conference on Machine Learning (ICML), November 2018. http://arxiv.org/abs/1711.03213
Ilse, M., Tomczak, J.M., Louizos, C., Welling, M.: DIVA: domain invariant variational autoencoders. arXiv preprint, May 2019. http://arxiv.org/abs/1905.10427
Janner, M., Fu, J., Zhang, M., Levine, S.: When to trust your model: model-based policy optimization. In: NeurIPS (2019)
Janner, M., Levine, S., Freeman, W.T., Tenenbaum, J.B., Finn, C., Wu, J.: Reasoning about physical interactions with object-oriented prediction and planning. In: International Conference on Learning Representations, December 2019. http://arxiv.org/abs/1812.10972
Kaiser, L., et al.: Model-based reinforcement learning for Atari. In: International Conference on Learning Representations (2019)
Kalchbrenner, N., et al.: Video pixel networks. arXiv preprint, October 2016. http://arxiv.org/abs/1610.00527
Kumar, A., Gupta, S., Malik, J.: Learning navigation subroutines by watching videos. CoRR abs/1905.12612 (2019). http://arxiv.org/abs/1905.12612
Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. arXiv:1804.01523 abs/1804.01523 (2018)
Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion GAN for future-flow embedded video prediction. In: International Conference on Computer Vision, August 2017. http://arxiv.org/abs/1708.00284
Liu, Y., Gupta, A., Abbeel, P., Levine, S.: Imitation from observation: learning to imitate behaviors from raw video via context translation. Ph.D. thesis, University of California, Berkeley, July 2018. http://arxiv.org/abs/1707.03374
Liu, Z., Yeh, R.A., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4463–4471 (2017)
Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint, May 2016. http://arxiv.org/abs/1605.08104
Lu, C., Hirsch, M., Scholkoph, B.: Flexible spatio-temporal networks for video prediction. In: Computer Vision and Pattern Recognition (2017)
Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: International Conference on Computer Vision, March 2017. http://arxiv.org/abs/1703.07684
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: International Conference on Learning Representations (2016)
Oh, J., Guo, X., Lee, H., Lewis, R., Singh, S.: Action-conditional video prediction using deep networks in atari games. In: Neural Information Processing Systems (2015)
Patraucean, V., Handa, A., Cipolla, R.: Spatio-temporal video autoencoder with differentiable memory. arXiv preprint, November 2015. http://arxiv.org/abs/1511.06309
Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)
Rizzolatti, G., Craighero, L.: The mirror-neuron system. Annu. Rev. Neurosci. 27, 169–192 (2004)
Rizzolatti, G., Fadiga, L., Gallese, V., Fogassi, L.: Premotor cortex and the recognition of motor actions. Cogn. Brain. Res. 3(2), 131–141 (1996)
Rybkin, O., Pertsch, K., Derpanis, K.G., Daniilidis, K., Jaegle, A.: Learning what you can do before doing anything. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=SylPMnR9Ym
Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: Proceedings of International Conference in Robotics and Automation (ICRA) (2018). http://arxiv.org/abs/1704.06888
Sharma, P., Pathak, D., Gupta, A.: Third-person visual imitation learning via decoupled hierarchical controller. In: Neural Information Processing Systems (2019)
Shon, A.P., Grochow, K., Hertzmann, A., Rao, R.P.: Learning shared latent structure for image synthesis and robotic imitation. In: Advances in Neural Information Processing Systems, pp. 1233–1240 (2005)
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning (ICML) (2015)
Stadie, B.C., Abbeel, P., Sutskever, I.: Third-person imitation learning. arXiv preprint arXiv:1703.01703 (2017)
Sun, M., Ma, X.: Adversarial imitation learning from incomplete demonstrations. In: International Joint Conference on Artificial Intelligence, May 2019. http://arxiv.org/abs/1905.12310
Sun, W., Vemula, A., Boots, B., Bagnell, J.A.: Provably efficient imitation learning from observation alone. In: International Conference on Machine Learning, May 2019. http://arxiv.org/abs/1905.10948
Taigman, Y., Polyak, A., Wolf, L.: Unsupervised cross-domain image generation. In: International Conference on Learning Representations, November 2017. https://arxiv.org/abs/1611.02200
Torabi, F., Warnell, G., Stone, P.: Behavioral cloning from observation. In: International Joint Conference on Artificial Intelligence, May 2018. http://arxiv.org/abs/1805.01954
Torabi, F., Warnell, G., Stone, P.: Generative adversarial imitation from observation. arXiv preprint, July 2018. http://arxiv.org/abs/1807.06158
Torabi, F., Warnell, G., Stone, P.: Imitation learning from video by leveraging proprioception. In: International Joint Conference on Artificial Intelligence, May 2019. http://arxiv.org/abs/1905.09335
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition (2018)
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Computer Vision and Pattern Recognition, February 2017. http://arxiv.org/abs/1702.05464
Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion: maximizing for domain invariance. arXiv preprint, December 2014. http://arxiv.org/abs/1412.3474
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: International Conference on Learning Representations (2017)
Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: Computer Vision and Pattern Recognition (2016)
Vondrick, C., Torralba, A.: Generating the future with adversarial transformers. In: Conference on Vision and Pattern Recognition (2017)
Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_51. http://arxiv.org/abs/1606.07873
Wang, A., Kurutach, T., Tamar, A., Abbeel, P.: Learning robotic manipulation through visual planning and acting. In: Robotics: Science and Systems (2019)
Watter, M., Springenberg, J.T., Boedecker, J., Riedmiller, M.: Embed to control: a locally linear latent dynamics model for control from raw images. In: Neural Information Processing Systems (2015)
Wichers, N., Villegas, R., Erhan, D., Lee, H.: Hierarchical long-term video prediction without supervision. In: ICML (2018)
Xie, A., Ebert, F., Levine, S., Finn, C.: Improvisation through physical understanding: using novel objects as tools with visual foresight. In: Robotics: Science and Systems, April 2019. http://arxiv.org/abs/1904.05538
Xue, T., Wu, J., Bouman, K.L., Freeman, W.T.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. (2016). http://arxiv.org/abs/1607.02586
Yen-Chen, L., Bauza, M., Isola, P.: Experience-embedded visual foresight. In: Conference on Robot Learning, November 2019. http://arxiv.org/abs/1911.05071
Yu, F., et al.: BDD100K: a diverse driving video database with scalable annotation tooling. arXiv preprint, May 2018. http://arxiv.org/abs/1805.04687
Yu, T., et al.: One-shot imitation from observing humans via domain-adaptive meta-learning. In: Robotics: Science and Systems, February 2018. http://arxiv.org/abs/1802.01557
Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M.J., Levine, S.: SOLAR: deep structured representations for model-based reinforcement learning. In: International Conference on Machine Learning, August 2018. http://arxiv.org/abs/1808.09105
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Computer Vision and Pattern Recognition (2018)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)
Zhuang, F., Cheng, X., Luo, P., Pan, S.J., He, Q.: Supervised representation learning with double encoding-layer autoencoder for transfer learning. In: International Joint Conference on Artificial Intelligence (2015). https://doi.org/10.1145/3108257
Acknowledgements
We thank Karl Pertsch, Drew Jaegle, Marvin Zhang, and Kenneth Chaney. This work was supported by the NSF GRFP, ARL RCTA W911NF-10-2-0016, ARL DCIST CRA W911NF-17-2-0181, and by Honda Research Institute.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Schmeckpeper, K. et al. (2020). Learning Predictive Models from Observation and Interaction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12365. Springer, Cham. https://doi.org/10.1007/978-3-030-58565-5_42
Download citation
DOI: https://doi.org/10.1007/978-3-030-58565-5_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58564-8
Online ISBN: 978-3-030-58565-5
eBook Packages: Computer ScienceComputer Science (R0)