Quantized depth image and skeleton-based multimodal dynamic hand gesture recognition | The Visual Computer Skip to main content
Log in

Quantized depth image and skeleton-based multimodal dynamic hand gesture recognition

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

An existing approach to dynamic hand gesture recognition is to use multimodal-fusion CRNN (Convolutional Recurrent Neural Networks) on depth images and corresponding 2D hand skeleton coordinates. However, an underlying problem in this method is that raw depth images possess a very low contrast in the hand ROI (region of interest). They do not highlight the details which are important to fine-grained hand gesture recognition details such as finger orientation, the overlap between the fingers and the palm, or overlap between multiple fingers. To address this issue, we propose generating quantized depth images as an alternative input modality to raw depth images. This creates sharp relative contrasts between key parts of the hand, which improves gesture recognition performance. In addition, we explore some ways to tackle the high variance problem in previously researched multimodal-fusion CRNN architectures. We obtained accuracies of 90.82 and 89.21% (14 and 28 gestures, respectively) on the DHG-14/28 dataset and accuracies of 93.81 and 90.24% (14 and 28 gestures, respectively) on the SHREC-2017 dataset, which is a significant improvement over previous multimodal-dusion CRNNs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Availability of data and materials

The associated code is available at this GitHub repository: https://github.com/ID56/Multimodal-Fusion-CRNN.

References

  1. Araujo, A., Norris, W., Sim, J.: Computing receptive fields of convolutional neural networks. Distill (2019). https://doi.org/10.23915/distill.00021. https://distill.pub/2019/computing-receptive-fields

  2. Barbhuiya, A.A., Karsh, R.K., Jain, R.: CNN based feature extraction and classification for sign language. Multimedia Tools Appl. 80(2), 3051–3069 (2021)

    Article  Google Scholar 

  3. Chen, Y., Zhao, L., Peng, X., et al.: Construct dynamic graphs for hand gesture recognition via spatial-temporal attention. arXiv:1907.08871 (2019)

  4. Chen, X., Wang, G., Guo, H., et al.: Mfa-net: motion feature augmented network for dynamic hand gesture recognition from skeletal data. Sensors 19(2), 239 (2019)

    Article  Google Scholar 

  5. De Smedt, Q., Wannous, H., Vandeborre, J.P., et al.: Shrec’17 track: 3d hand gesture recognition using a depth and skeletal dataset. In: 3DOR-10th Eurographics Workshop on 3D Object Retrieval, pp. 1–6 (2017)

  6. De Smedt, Q., Wannous, H., Vandeborre, J.P.: Skeleton-based dynamic hand gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–9 (2016)

  7. Deng, J., Dong, W., Socher, R., et al.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

  8. Desai, S., Desai, A.: Human computer interaction through hand gestures for home automation using microsoft kinect. In: Proceedings of International Conference on Communication and Networks, pp. 19–29. Springer (2017)

  9. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  10. Facebook: Fvcore library (2019). https://github.com/facebookresearch/fvcore

  11. Foto, B.H., Corp, E.: Intel realsense depth module sr300 (online) (2021). https://www.bhphotovideo.com/c/product/1567309-REG/intel_82535ivchvm_realsense_camera_sr300.html/specs. Accessed 1 Aug 2021

  12. Geirhos, R., Rubisch, P., Michaelis, C., et al.: Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231 (2018)

  13. Hou, J., Wang, G., Chen, X., et al.: Spatial-temporal attention res-tcn for skeleton-based dynamic hand gesture recognition. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)

  14. Iwai, Y., Watanabe, K., Yagi, Y., et al.: Gesture recognition by using colored gloves. In: 1996 IEEE International Conference on Systems, Man and Cybernetics. Information Intelligence and Systems (Cat. No. 96CH35929), pp. 76–81. IEEE (1996)

  15. Jain, R., Karsh, R.K., Barbhuiya, A.A.: Encoded motion image-based dynamic hand gesture recognition. Vis. Comput. 38(6), 1957–1974 (2022)

    Article  Google Scholar 

  16. Koller, O., Zargaran, S., Ney, H., et al.: Deep sign: enabling robust statistical continuous sign language recognition via hybrid CNN-HMMS. Int. J. Comput. Vis. 126(12), 1311–1325 (2018)

    Article  Google Scholar 

  17. Kopuklu, O., Kose, N., Rigoll, G.: Motion fused frames: Data level fusion strategy for hand gesture recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 1–9 (2018)

  18. Köpüklü, O., Ledwon, T., Rong, Y., et al.: Drivermhg: a multi-modal dataset for dynamic recognition of driver micro hand gestures and a real-time recognition framework. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 77–84. IEEE (2020)

  19. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012)

    Google Scholar 

  20. Kurakin, A., Zhang, Z., Liu, Z.: A real time system for dynamic hand gesture recognition with a depth sensor. In: 2012 Proceedings of the 20th European signal processing conference (EUSIPCO), pp. 1975–1979. IEEE (2012)

  21. Lai, K., Yanushkevich, S.: An ensemble of knowledge sharing models for dynamic hand gesture recognition. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp 1–7. IEEE (2020)

  22. Lai, K., Yanushkevich, S.N.: CNN+ RNN depth and skeleton based dynamic hand gesture recognition. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 3451–3456. IEEE (2018)

  23. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  24. Mahmud, H., Islam, R., Hasan, M.K.: On-air English capital alphabet (ECA) recognition using depth information. Vis. Comput. https://doi.org/10.1007/s00371-021-02065-x. https://link.springer.com/article/10.1007%2Fs00371-021-02065-x

  25. Min, Y., Zhang, Y., Chai, X., et al.: An efficient pointlstm for point clouds based gesture recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5760–5769 (2020). https://doi.org/10.1109/CVPR42600.2020.00580

  26. Molchanov, P., Yang, X., Gupta, S., et al.: Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  27. Nagi, J., Ducatelle, F., Di Caro, G.A., et al.: Max-pooling convolutional neural networks for vision-based hand gesture recognition. In: 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pp. 342–347. IEEE (2011)

  28. Naguri, C.R., Bunescu, R.C.: Recognition of dynamic hand gestures from 3d motion data using LSTM and CNN architectures. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 1130–1133 (2017). https://doi.org/10.1109/ICMLA.2017.00013

  29. Nunez, J.C., Cabido, R., Pantrigo, J.J., et al.: Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recogn. 76, 80–94 (2018)

    Article  Google Scholar 

  30. Oudah, M., Al-Naji, A., Chahl, J.: Hand gesture recognition based on computer vision: a review of techniques. J. Imaging 6(8), 73 (2020)

    Article  Google Scholar 

  31. Pintea, S.L., Zheng, J., Li, X., et al.: Hand-tremor frequency estimation in videos. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)

  32. Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017)

  33. Rogozhnikov, A.: Einops: flexible and powerful tensor operations for readable and reliable code (2018). https://github.com/arogozhnikov/einops

  34. Tao, W., Leu, M.C., Yin, Z.: American sign language alphabet recognition using convolutional neural networks with multiview augmentation and inference fusion. Eng. Appl. Artif. Intell. 76, 202–213 (2018)

    Article  Google Scholar 

  35. Vandersteegen, M., Reusen, W., Van Beeck, K., et al.: Low-latency hand gesture recognition with a low-resolution thermal imager. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 98–99 (2020)

  36. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: McIlraith SA, Weinberger KQ (eds) Proceedings of the 32nd AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2–7, 2018, pp. 7444–7452. AAAI Press (2018). https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17135

  37. Zhang, Y., Cao, C., Cheng, J., et al.: Egogesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans. Multimedia 20(5), 1038–1050 (2018)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hasan Mahmud.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mahmud, H., Morshed, M.M. & Hasan, M.K. Quantized depth image and skeleton-based multimodal dynamic hand gesture recognition. Vis Comput 40, 11–25 (2024). https://doi.org/10.1007/s00371-022-02762-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-022-02762-1

Keywords