Early Anticipation of Driving Maneuvers

Wasi, Abdul; Gangisetty, Shankar; Rai, Shyam Nandan; Jawahar, C. V.

doi:10.1007/978-3-031-72897-6_9

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15128))

Included in the following conference series:

European Conference on Computer Vision

269 Accesses

Abstract

Prior works have addressed the problem of driver intention prediction (DIP) by identifying maneuvers after their onset. On the other hand, early anticipation is equally important in scenarios that demand a preemptive response before a maneuver begins. However, there is no prior work aimed at addressing the problem of driver action anticipation before the onset of the maneuver, limiting the ability of the advanced driver assistance system (ADAS) for early maneuver anticipation. In this work, we introduce Anticipating Driving Maneuvers (ADM), a new task that enables driver action anticipation before the onset of the maneuver. To initiate research in ADM task, we curate Driving Action Anticipation Dataset, DAAD, that is multi-view: in- and out-cabin views in dense and heterogeneous scenarios, and multimodal: egocentric view and gaze information. The dataset captures sequences both before the initiation and during the execution of a maneuver. During dataset collection, we also ensure to capture wide diversity in traffic scenarios, weather and illumination, and driveway conditions. Next, we propose a strong baseline based on a transformer architecture to effectively model multiple views and modalities over longer video lengths. We benchmark the existing DIP methods on DAAD and related datasets. Finally, we perform an ablation study showing the effectiveness of multiple views and modalities in maneuver anticipation. Project Page: https://cvit.iiit.ac.in/research/projects/cvit-projects/daad.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

VIENA $$^2$$ : A Driving Anticipation Dataset

MDAD: A Multimodal and Multiview in-Vehicle Driver Action Dataset

ICPR 2024 Competition on Rider Intention Prediction

Notes

1.
By “anticipate”, we refer to the model’s ability to predict a maneuver a few seconds before its actual execution.
2.
We use “multi-view” for more than two views. None of the aforementioned datasets other than AIDE [50] are multi-view. However, it has only 3 maneuver classes with 3 s long videos.

References

Aliakbarian, M.S., Saleh, F.S., Salzmann, M., Fernando, B., Petersson, L., Andersson, L.: Viena2: a driving anticipation dataset (2018)
Google Scholar
Amadori, P.V., Fischer, T., Wang, R., Demiris, Y.: Decision anticipation for driving assistance systems. In: ITSC, pp. 1–7. IEEE (2020)
Google Scholar
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV, pp. 609–617 (2017)
Google Scholar
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about kinetics-600. CoRR (2018)
Google Scholar
Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV, pp. 720–736 (2018)
Google Scholar
Damen, D., et al.: Rescaling egocentric vision: collection, pipeline and challenges for epic-kitchens-100. In: IJCV, pp. 1–23 (2022)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Dosovitskiy, A., et al.: Flownet: learning optical flow with convolutional networks. In: ICCV, pp. 2758–2766 (2015)
Google Scholar
Dutta, A., Zisserman, A.: The VIA annotation software for images, audio and video. In: ACM Multimedia, pp. 2276–2279 (2019)
Google Scholar
Fan, H., et al.: Multiscale vision transformers. In: ICCV, pp. 6824–6835 (2021)
Google Scholar
Furnari, A., Farinella, G.M.: What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In: ICCV, pp. 6252–6261 (2019)
Google Scholar
Furnari, A., Farinella, G.M.: Rolling-unrolling LSTMs for action anticipation from first-person video. IEEE TPAMI 43(11), 4021–4036 (2020)
Article Google Scholar
Gao, J., Yang, Z., Nevatia, R.: RED: reinforced encoder-decoder networks for action anticipation. In: BMVC. BMVA Press (2017)
Google Scholar
Gebert, P., Roitberg, A., Haurilet, M., Stiefelhagen, R.: End-to-end prediction of driver intention using 3D convolutional neural networks. In: IEEE Intelligent Vehicles Symposium (IV), pp. 969–974 (2019)
Google Scholar
Girase, H., Agarwal, N., Choi, C., Mangalam, K.: Latency matters: real-time action forecasting transformer. In: CVPR, pp. 18759–18769 (2023)
Google Scholar
Girdhar, R., Grauman, K.: Anticipative video transformer. In: ICCV, pp. 13505–13515 (2021)
Google Scholar
Girdhar, R., Singh, M., Ravi, N., van der Maaten, L., Joulin, A., Misra, I.: Omnivore: a single model for many visual modalities. In: CVPR, pp. 16102–16112 (2022)
Google Scholar
Gong, D., Lee, J., Kim, M., Ha, S.J., Cho, M.: Future transformer for long-term action anticipation. In: CVPR, pp. 3052–3061 (2022)
Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet? In: CVPR, pp. 6546–6555 (2018)
Google Scholar
Huang, D.-A., Kitani, K.M.: Action-reaction: forecasting the dynamics of human interaction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 489–504. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_32
Chapter Google Scholar
Jain, A., Koppula, H.S., Soh, S., Raghavan, B., Saxena, A.: Car that knows before you do: anticipating maneuvers via learning temporal driving models. In: ICCV (2015)
Google Scholar
Jain, A., Singh, A., Koppula, H.S., Soh, S., Saxena, A.: Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In: ICRA, pp. 3118–3125. IEEE (2016)
Google Scholar
Kasahara, I., Stent, S., Park, H.S.: Look both ways: Self-supervising driver gaze estimation and road scene saliency. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13673, pp. 126–142. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19778-9_8
Chapter Google Scholar
Khairdoost, N., Shirpour, M., Bauer, M.A., Beauchemin, S.S.: Real-time driver maneuver prediction using LSTM. IEEE Trans. Intell. Veh. 5(4), 714–724 (2020)
Article Google Scholar
Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. IEEE TPAMI 14–29 (2015)
Google Scholar
Li, Y., et al.: Mvitv2: improved multiscale vision transformers for classification and detection. In: CVPR, pp. 4804–4814 (2022)
Google Scholar
Liu, C., Chen, Y., Tai, L., Ye, H., Liu, M., Shi, B.E.: A gaze model improves autonomous driving. In: ACM Symposium on Eye Tracking Research & Applications, pp. 1–5 (2019)
Google Scholar
Liu, M., Tang, S., Li, Y., Rehg, J.M.: Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 704–721. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_41
Chapter Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (Poster) (2019)
Google Scholar
Ma, Y., et al.: Cemformer: learning to predict driver intentions from in-cabin and external cameras via spatial-temporal transformers. In: ITSC, pp. 4960–4966. IEEE (2023)
Google Scholar
Nagarajan, T., Li, Y., Feichtenhofer, C., Grauman, K.: Ego-topo: environment affordances from egocentric video. In: CVPR, pp. 163–172 (2020)
Google Scholar
Nawhal, M., Jyothi, A.A., Mori, G.: Rethinking learning approaches for long-term action anticipation. In: ECCV, pp. 558–576 (2022)
Google Scholar
Pal, A., Mondal, S., Christensen, H.I.: Looking at the right stuff-guided semantic-gaze for autonomous driving. In: CVPR, pp. 11883–11892 (2020)
Google Scholar
Palazzi, A., Abati, D., Solera, F., Cucchiara, R., et al.: Predicting the driver’s focus of attention: the DR (eye) VE project. IEEE TPAMI 41(7), 1720–1733 (2018)
Article Google Scholar
Pang, B., Zha, K., Cao, H., Shi, C., Lu, C.: Deep RNN framework for visual sequential applications. In: CVPR, pp. 423–432 (2019)
Google Scholar
Ramanishka, V., Chen, Y.T., Misu, T., Saenko, K.: Toward driving scene understanding: a dataset for learning driver behavior and causal reasoning. In: CVPR (2018)
Google Scholar
Rong, Y., Akata, Z., Kasneci, E.: Driver intention anticipation based on in-cabin and driving scene monitoring. In: IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pp. 1–8 (2020)
Google Scholar
Sandler, M., Zhmoginov, A., Vladymyrov, M., Jackson, A.: Fine-tuning image transformers using learnable memory. In: CVPR, pp. 12155–12164 (2022)
Google Scholar
Sener, F., Singhania, D., Yao, A.: Temporal aggregate representations for long-range video understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 154–171. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_10
Chapter Google Scholar
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: NeurIPS, vol. 28 (2015)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NeurIPS, vol. 27 (2014)
Google Scholar
Somasundaram, K., et al.: Project aria: a new tool for egocentric multi-modal AI research. CoRR (2023)
Google Scholar
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: Videobert: a joint model for video and language representation learning. In: ICCV, pp. 7464–7473 (2019)
Google Scholar
Tziafas, G., Kasaei, H.: Early or late fusion matters: efficient RGB-D fusion in vision transformers for 3D object recognition. In: IROS, pp. 9558–9565. IEEE (2023)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
Google Scholar
Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: CVPR, pp. 98–106 (2016)
Google Scholar
Wu, C.Y., et al.: Memvit: memory-augmented multiscale vision transformer for efficient long-term video recognition. In: CVPR, pp. 13587–13597 (2022)
Google Scholar
Wu, M., et al.: Gaze-based intention anticipation over driving manoeuvres in semi-autonomous vehicles. In: IROS, pp. 6210–6216. IEEE (2019)
Google Scholar
Xia, Y., Zhang, D., Kim, J., Nakayama, K., Zipser, K., Whitney, D.: Predicting driver attention in critical situations. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11365, pp. 658–674. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20873-8_42
Chapter Google Scholar
Yang, D., et al.: Aide: a vision-driven multi-view, multi-modal, multi-tasking dataset for assistive driving perception. In: ICCV, pp. 20459–20470 (2023)
Google Scholar
Zhong, Z., Schneider, D., Voit, M., Stiefelhagen, R., Beyerer, J.: Anticipative feature fusion transformer for multi-modal action anticipation. In: CVPR, pp. 6068–6077 (2023)
Google Scholar
Zhou, F., Yang, X.J., De Winter, J.C.: Using eye-tracking data to predict situation awareness in real time during takeover transitions in conditionally automated driving. IEEE TITS 23(3), 2284–2295 (2021)
Google Scholar
Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: CVPR, pp. 2349–2358 (2017)
Google Scholar

Download references

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad and Project Aria from Meta.

Author information

Authors and Affiliations

IIIT Hyderabad, Hyderabad, India
Abdul Wasi, Shankar Gangisetty & C. V. Jawahar
Politecnico di Torino, Turin, Italy
Shyam Nandan Rai

Authors

Abdul Wasi
View author publications
You can also search for this author in PubMed Google Scholar
Shankar Gangisetty
View author publications
You can also search for this author in PubMed Google Scholar
Shyam Nandan Rai
View author publications
You can also search for this author in PubMed Google Scholar
C. V. Jawahar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shankar Gangisetty .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1673 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wasi, A., Gangisetty, S., Rai, S.N., Jawahar, C.V. (2025). Early Anticipation of Driving Maneuvers. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15128. Springer, Cham. https://doi.org/10.1007/978-3-031-72897-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-72897-6_9
Published: 02 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72896-9
Online ISBN: 978-3-031-72897-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Early Anticipation of Driving Maneuvers