Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video | SpringerLink
Skip to main content

Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12346))

Included in the following conference series:

  • 16k Accesses

Abstract

We address the challenging task of anticipating human-object interaction in first person videos. Most existing methods either ignore how the camera wearer interacts with objects, or simply considers body motion as a separate modality. In contrast, we observe that the intentional hand movement reveals critical information about the future activity. Motivated by this observation, we adopt intentional hand movement as a feature representation, and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action. Specifically, we consider the future hand motion as the motor attention, and model this attention using probabilistic variables in our deep model. The predicted motor attention is further used to select the discriminative spatial-temporal visual features for predicting actions and interaction hotspots. We present extensive experiments demonstrating the benefit of the proposed joint model. Importantly, our model produces new state-of-the-art results for action anticipation on both EGTEA Gaze+ and the EPIC-Kitchens datasets. Our project page is available at https://aptx4869lm.github.io/ForecastingHOI/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    See supplementary material for the derivation.

References

  1. Aglioti, S.M., Cesari, P., Romani, M., Urgesi, C.: Action anticipation and motor resonance in elite basketball players. Nat. Neurosci. 11(9), 1109 (2008)

    Article  Google Scholar 

  2. Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: ICCV (2019)

    Google Scholar 

  3. Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: human trajectory prediction in crowded spaces. In: CVPR (2016)

    Google Scholar 

  4. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)

    Google Scholar 

  5. Chen, C.Y., Grauman, K.: Subjects and their objects: localizing interactees for a person-centric view of importance. Int. J. Comput. Vision 126(2–4), 292–313 (2018). https://doi.org/10.1007/s11263-016-0958-6

    Article  MathSciNet  Google Scholar 

  6. Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 753–771. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_44

    Chapter  Google Scholar 

  7. Delaitre, V., Fouhey, D.F., Laptev, I., Sivic, J., Gupta, A., Efros, A.A.: Scene semantics from long-term observation of people. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 284–298. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_21

    Chapter  Google Scholar 

  8. di Pellegrino, G., Fadiga, L., Fogassi, L., Gallese, V., Rizzolatti, G.: Understanding motor events: a neurophysiological study. Exp. Brain Res. 91, 176–180 (1992). https://doi.org/10.1007/BF00230027

    Article  Google Scholar 

  9. Fang, K., Wu, T.L., Yang, D., Savarese, S., Lim, J.J.: Demo2Vec: reasoning object affordances from online videos. In: CVPR (2018)

    Google Scholar 

  10. Fathi, A., Farhadi, A., Rehg, J.M.: Understanding egocentric activities. In: ICCV (2011)

    Google Scholar 

  11. Felsen, P., Agrawal, P., Malik, J.: What will happen next? Forecasting player moves in sports videos. In: ICCV (2017)

    Google Scholar 

  12. Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)

    Google Scholar 

  13. Furnari, A., Battiato, S., Grauman, K., Farinella, G.M.: Next-active-object prediction from egocentric videos. J. Vis. Commun. Image Represent. 49, 401–411 (2017)

    Article  Google Scholar 

  14. Furnari, A., Battiato, S., Farinella, G.M.: Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11133, pp. 389–405. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11021-5_24

    Chapter  Google Scholar 

  15. Furnari, A., Farinella, G.M.: What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In: ICCV (2019)

    Google Scholar 

  16. Gao, J., Yang, Z., Nevatia, R.: Red: reinforced encoder-decoder networks for action anticipation. In: BMVC (2017)

    Google Scholar 

  17. Ghadiyaram, D., Tran, D., Mahajan, D.: Large-scale weakly-supervised pre-training for video action recognition. In: CVPR (2019)

    Google Scholar 

  18. Grabner, H., Gall, J., Van Gool, L.: What makes a chair a chair? In: CVPR (2011)

    Google Scholar 

  19. Gui, L.-Y., Wang, Y.-X., Liang, X., Moura, J.M.F.: Adversarial geometry-aware human motion prediction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 823–842. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_48

    Chapter  Google Scholar 

  20. Hari, R., Forss, N., Avikainen, S., Kirveskari, E., Salenius, S., Rizzolatti, G.: Activation of human primary motor cortex during action observation: a neuromagnetic study. Proc. Natl. Acad. Sci. 95(25), 15061–15065 (1998)

    Article  Google Scholar 

  21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  22. Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.: Deeply supervised salient object detection with short connections. In: CVPR (2017)

    Google Scholar 

  23. Huang, Y., Cai, M., Li, Z., Sato, Y.: Predicting gaze in egocentric video by learning task-dependent attention transition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 789–804. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_46

    Chapter  Google Scholar 

  24. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)

    Google Scholar 

  25. James, W., Burkhardt, F., Bowers, F., Skrupskelis, I.K.: The Principles of Psychology, vol. 1. Macmillan, London (1890)

    Google Scholar 

  26. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-Softmax. In: ICLR (2017)

    Google Scholar 

  27. Kataoka, H., Miyashita, Y., Hayashi, M., Iwata, K., Satoh, Y.: Recognition of transitional action for short-term action prediction using discriminative temporal CNN feature. In: BMVC (2016)

    Google Scholar 

  28. Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In: CVPR (2019)

    Google Scholar 

  29. Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 201–214. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_15

    Chapter  Google Scholar 

  30. Kong, Y., Fu, Y.: Human action recognition and prediction: a survey. arXiv preprint arXiv:1806.11230 (2018)

  31. Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 14–29 (2015)

    Article  Google Scholar 

  32. Li, Y., Liu, M., Rehg, J.M.: In the eye of beholder: joint learning of gaze and actions in first person video. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 639–655. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_38

    Chapter  Google Scholar 

  33. Li, Y., Liu, M., Rehg, J.M.: In the eye of the beholder: gaze and actions in first person video. arXiv preprint arXiv:2006.00626 (2020)

  34. Li, Y., Ye, Z., Rehg, J.M.: Delving into egocentric actions. In: CVPR (2015)

    Google Scholar 

  35. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)

    Google Scholar 

  36. Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition. In: CVPR (2016)

    Google Scholar 

  37. Maddison, C.J., Mnih, A., Teh, Y.W.: The concrete distribution: a continuous relaxation of discrete random variables. In: ICLR (2017)

    Google Scholar 

  38. Miech, A., Laptev, I., Sivic, J., Wang, H., Torresani, L., Tran, D.: Leveraging the present to anticipate the future in videos. In: CVPR Workshops (2019)

    Google Scholar 

  39. Nagarajan, T., Feichtenhofer, C., Grauman, K.: Grounded human-object interaction hotspots from video. In: ICCV (2019)

    Google Scholar 

  40. Pavlovic, V., Rehg, J.M., MacCormick, J.: Learning switching linear models of human motion. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) NeurIPS, pp. 981–987. MIT Press, Cambridge (2001)

    Google Scholar 

  41. Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: CVPR (2012)

    Google Scholar 

  42. Poleg, Y., Ephrat, A., Peleg, S., Arora, C.: Compact CNN for indexing egocentric videos. In: WACV (2016)

    Google Scholar 

  43. Rhinehart, N., Kitani, K.M.: Learning action maps of large environments via first-person vision. In: CVPR (2016)

    Google Scholar 

  44. Rhinehart, N., Kitani, K.M.: First-person activity forecasting with online inverse reinforcement learning. In: ICCV (2017)

    Google Scholar 

  45. Rushworth, M., Johansen-Berg, H., Göbel, S.M., Devlin, J.: The left parietal and premotor cortices: motor attention and selection. Neuroimage 20, S89–S100 (2003)

    Article  Google Scholar 

  46. Ryoo, M.S., Rothrock, B., Matthies, L.: Pooled motion features for first-person videos. In: CVPR (2015)

    Google Scholar 

  47. Ryoo, M., Fuchs, T.J., Xia, L., Aggarwal, J.K., Matthies, L.: Robot-centric activity prediction from first-person videos: what will they do to me? In: HRI (2015)

    Google Scholar 

  48. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: ICCV (2017)

    Google Scholar 

  49. Shen, Y., Ni, B., Li, Z., Zhuang, N.: Egocentric activity prediction via event modulated attention. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 202–217. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_13

    Chapter  Google Scholar 

  50. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

    Google Scholar 

  51. Soo Park, H., Hwang, J.J., Niu, Y., Shi, J.: Egocentric future localization. In: CVPR (2016)

    Google Scholar 

  52. Soran, B., Farhadi, A., Shapiro, L.: Generating notifications for missing actions: don’t forget to turn the lights off! In: ICCV (2015)

    Google Scholar 

  53. Thermos, S., Papadopoulos, G.T., Daras, P., Potamianos, G.: Deep affordance-grounded sensorimotor object recognition. In: CVPR (2017)

    Google Scholar 

  54. Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel-separated convolutional networks. In: ICCV (2019)

    Google Scholar 

  55. Urtasun, R., Fleet, D.J., Geiger, A., Popović, J., Darrell, T.J., Lawrence, N.D.: Topologically-constrained latent variable models. In: ICML (2008)

    Google Scholar 

  56. Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: CVPR (2016)

    Google Scholar 

  57. Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: ICCV (2017)

    Google Scholar 

  58. Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models for human motion. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 283–298 (2007)

    Article  Google Scholar 

  59. Wang, X., Girdhar, R., Gupta, A.: Binge watching: scaling affordance learning from sitcoms. In: CVPR (2017)

    Google Scholar 

  60. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)

    Google Scholar 

  61. Wei, P., Xie, D., Zheng, N., Zhu, S.C.: Inferring human attention by learning latent intentions. In: IJCAI (2017)

    Google Scholar 

  62. Yagi, T., Mangalam, K., Yonetani, R., Sato, Y.: Future person localization in first-person videos. In: CVPR (2018)

    Google Scholar 

  63. Zhou, Y., Ni, B., Hong, R., Yang, X., Tian, Q.: Cascaded interactional targeting network for egocentric video analysis. In: CVPR (2016)

    Google Scholar 

Download references

Acknowledgments

Portions of this research were supported in part by National Science Foundation Award 1936970 and a gift from Facebook. YL acknowledges the support from the Wisconsin Alumni Research Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Miao Liu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1022 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, M., Tang, S., Li, Y., Rehg, J.M. (2020). Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12346. Springer, Cham. https://doi.org/10.1007/978-3-030-58452-8_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58452-8_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58451-1

  • Online ISBN: 978-3-030-58452-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics