Fusing Hand and Body Skeletons for Human Action Recognition in Assembly

Aganian, Dustin; Köhler, Mona; Stephan, Benedict; Eisenbach, Markus; Gross, Horst-Michael

doi:10.1007/978-3-031-44207-0_18

Dustin Aganian ORCID: orcid.org/0009-0006-3925-6718¹¹,
Mona Köhler¹¹,
Benedict Stephan¹¹,
Markus Eisenbach¹¹ &
…
Horst-Michael Gross¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14254))

Included in the following conference series:

International Conference on Artificial Neural Networks

1579 Accesses

Abstract

As collaborative robots (cobots) continue to gain popularity in industrial manufacturing, effective human-robot collaboration becomes crucial. Cobots should be able to recognize human actions to assist with assembly tasks and act autonomously. To achieve this, skeleton-based approaches are often used due to their ability to generalize across various people and environments. Although body skeleton approaches are widely used for action recognition, they may not be accurate enough for assembly actions where the worker’s fingers and hands play a significant role. To address this limitation, we propose a method in which less detailed body skeletons are combined with highly detailed hand skeletons. We investigate CNNs and transformers, the latter of which are particularly adept at extracting and combining important information from both skeleton types using attention. This paper demonstrates the effectiveness of our proposed approach in enhancing action recognition in assembly scenarios.

This work has received funding from the Carl-Zeiss-Stiftung as part of the project engineering for smart manufacturing (E4SM).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 9380; Price includes VAT (Japan)

Softcover Book: JPY 11725; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Skeleton-Based Action and Gesture Recognition for Human-Robot Collaboration

The HA4M dataset: Multi-Modal Monitoring of an assembly task for Human Action recognition in Manufacturing

Article Open access 02 December 2022

Praxis: a framework for AI-driven human action recognition in assembly

Article Open access 24 November 2023

Notes

1.
Our preliminary experiments on the ATTACH dataset using hand skeletons solely showed far inferior results compared to body skeletons solely and are thus not investigated further.
2.
For ResNet, the image is resized to \(224{\times }224\) with pixel values ranging from 0–255. For Swin, we use a resolution of \(256{\times }256\) with pixel values from 0-1.

References

Aganian, D., Köhler, M., Baake, S., Eisenbach, M., Gross, H.M.: How object information improves skeleton-based human action recognition in assembly tasks. In: IEEE International Joint Conference on Neural Networks (IJCNN) (2023)
Google Scholar
Aganian, D., Stephan, B., Eisenbach, M., Stretz, C., Gross, H.M.: ATTACH dataset: annotated two-handed assembly actions for human action understanding. In: IEEE International Conference on Robotics and Automation (ICRA) (2023)
Google Scholar
Ben-Shabat, Y., et al.: The IKEA ASM dataset: understanding people assembling furniture through actions, objects and pose. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2021)
Google Scholar
Du, Y., Fu, Y., Wang, L.: Skeleton based action recognition with convolutional neural network. In: IEEE IAPR Asian Conference on Pattern Recognition (ACPR) (2015)
Google Scholar
Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Eisenbach, M., Aganian, D., Köhler, M., Stephan, B., Schröter, C., Gross, H.M.: Visual scene understanding for enabling situation-aware cobots. In: IEEE International Conference on Automation Science and Engineering (CASE) (2021)
Google Scholar
Fischedick, S., Seichter, D., Schmidt, R., Rabes, L., Gross, H.M.: Efficient multi-task scene analysis with RGB-D transformers. In: IEEE International Joint Conference on Neural Networks (IJCNN) (2023)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Inkulu, A.K., Bahubalendruni, M.R., Dara, A., SankaranarayanaSamy, K.: Challenges and opportunities in human robot collaboration context of industry 4.0 - a state of the art review. Ind. Robot: Int. J. Robot. Res. Appl. 49(2) (2021)
Google Scholar
Liu, Z., et al.: Swin transformer v2: scaling up capacity and resolution. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Mazzia, V., Angarano, S., Salvetti, F., Angelini, F., Chiaberge, M.: Action transformer: a self-attention model for short-time pose-based human action recognition. Pattern Recogn., 124 (2022)
Google Scholar
Ragusa, F., Furnari, A., Livatino, S., Farinella, G.M.: The MECCANO dataset: understanding human-object interactions from egocentric videos in an industrial-like domain. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2021)
Google Scholar
Seichter, D., Köhler, M., Lewandowski, B., Wengefeld, T., Gross, H.M.: Efficient RGB-D semantic segmentation for indoor scene analysis. In: International Conference on Robotics and Automation (ICRA) (2021)
Google Scholar
Sener, F., et al.: Assembly101: a large-scale multi-view video dataset for understanding procedural activities. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Terreran, M., Lazzaretto, M., Ghidoni, S.: Skeleton-based action and gesture recognition for human-robot collaboration. In: International Conference on Intelligent Autonomous Systems (IAS). Springer (2022). https://doi.org/10.1007/978-3-031-22216-0_3
Trivedi, N., Sarvadevabhatla, R.K.: PSUMNet: unified modality part streams are all you need for efficient pose-based action recognition. In: ECCV Workshop and Challenge on People Analysis (WCPA). Springer (2022). https://doi.org/10.1007/978-3-031-25072-9_14
Trivedi, N., Thatipelli, A., Sarvadevabhatla, R.K.: NTU-X: an enhanced large-scale dataset for improving pose-based recognition of subtle human actions. In: Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP). ACM (2021)
Google Scholar
Wang, L., et al.: Symbiotic human-robot collaborative assembly. CIRP annals 68(2) (2019)
Google Scholar
Zhang, F., et al.: MediaPipe hands: on-device real-time hand tracking. In: Workshop on Computer Vision for AR/VR (CV4ARVR) (2020)
Google Scholar
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Ilmenau University of Technology, Neuroinformatics and Cognitive Robotics Lab, 98693, Ilmenau, Germany
Dustin Aganian, Mona Köhler, Benedict Stephan, Markus Eisenbach & Horst-Michael Gross

Authors

Dustin Aganian
View author publications
You can also search for this author in PubMed Google Scholar
Mona Köhler
View author publications
You can also search for this author in PubMed Google Scholar
Benedict Stephan
View author publications
You can also search for this author in PubMed Google Scholar
Markus Eisenbach
View author publications
You can also search for this author in PubMed Google Scholar
Horst-Michael Gross
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dustin Aganian .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Lancaster University, Lancaster, UK
Plamen Angelov
Teesside University, Middlesbrough, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aganian, D., Köhler, M., Stephan, B., Eisenbach, M., Gross, HM. (2023). Fusing Hand and Body Skeletons for Human Action Recognition in Assembly. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14254. Springer, Cham. https://doi.org/10.1007/978-3-031-44207-0_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-44207-0_18
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44206-3
Online ISBN: 978-3-031-44207-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fusing Hand and Body Skeletons for Human Action Recognition in Assembly