Abstract
An existing approach to dynamic hand gesture recognition is to use multimodal-fusion CRNN (Convolutional Recurrent Neural Networks) on depth images and corresponding 2D hand skeleton coordinates. However, an underlying problem in this method is that raw depth images possess a very low contrast in the hand ROI (region of interest). They do not highlight the details which are important to fine-grained hand gesture recognition details such as finger orientation, the overlap between the fingers and the palm, or overlap between multiple fingers. To address this issue, we propose generating quantized depth images as an alternative input modality to raw depth images. This creates sharp relative contrasts between key parts of the hand, which improves gesture recognition performance. In addition, we explore some ways to tackle the high variance problem in previously researched multimodal-fusion CRNN architectures. We obtained accuracies of 90.82 and 89.21% (14 and 28 gestures, respectively) on the DHG-14/28 dataset and accuracies of 93.81 and 90.24% (14 and 28 gestures, respectively) on the SHREC-2017 dataset, which is a significant improvement over previous multimodal-dusion CRNNs.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and materials
The associated code is available at this GitHub repository: https://github.com/ID56/Multimodal-Fusion-CRNN.
References
Araujo, A., Norris, W., Sim, J.: Computing receptive fields of convolutional neural networks. Distill (2019). https://doi.org/10.23915/distill.00021. https://distill.pub/2019/computing-receptive-fields
Barbhuiya, A.A., Karsh, R.K., Jain, R.: CNN based feature extraction and classification for sign language. Multimedia Tools Appl. 80(2), 3051–3069 (2021)
Chen, Y., Zhao, L., Peng, X., et al.: Construct dynamic graphs for hand gesture recognition via spatial-temporal attention. arXiv:1907.08871 (2019)
Chen, X., Wang, G., Guo, H., et al.: Mfa-net: motion feature augmented network for dynamic hand gesture recognition from skeletal data. Sensors 19(2), 239 (2019)
De Smedt, Q., Wannous, H., Vandeborre, J.P., et al.: Shrec’17 track: 3d hand gesture recognition using a depth and skeletal dataset. In: 3DOR-10th Eurographics Workshop on 3D Object Retrieval, pp. 1–6 (2017)
De Smedt, Q., Wannous, H., Vandeborre, J.P.: Skeleton-based dynamic hand gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–9 (2016)
Deng, J., Dong, W., Socher, R., et al.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Desai, S., Desai, A.: Human computer interaction through hand gestures for home automation using microsoft kinect. In: Proceedings of International Conference on Communication and Networks, pp. 19–29. Springer (2017)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Facebook: Fvcore library (2019). https://github.com/facebookresearch/fvcore
Foto, B.H., Corp, E.: Intel realsense depth module sr300 (online) (2021). https://www.bhphotovideo.com/c/product/1567309-REG/intel_82535ivchvm_realsense_camera_sr300.html/specs. Accessed 1 Aug 2021
Geirhos, R., Rubisch, P., Michaelis, C., et al.: Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231 (2018)
Hou, J., Wang, G., Chen, X., et al.: Spatial-temporal attention res-tcn for skeleton-based dynamic hand gesture recognition. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)
Iwai, Y., Watanabe, K., Yagi, Y., et al.: Gesture recognition by using colored gloves. In: 1996 IEEE International Conference on Systems, Man and Cybernetics. Information Intelligence and Systems (Cat. No. 96CH35929), pp. 76–81. IEEE (1996)
Jain, R., Karsh, R.K., Barbhuiya, A.A.: Encoded motion image-based dynamic hand gesture recognition. Vis. Comput. 38(6), 1957–1974 (2022)
Koller, O., Zargaran, S., Ney, H., et al.: Deep sign: enabling robust statistical continuous sign language recognition via hybrid CNN-HMMS. Int. J. Comput. Vis. 126(12), 1311–1325 (2018)
Kopuklu, O., Kose, N., Rigoll, G.: Motion fused frames: Data level fusion strategy for hand gesture recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 1–9 (2018)
Köpüklü, O., Ledwon, T., Rong, Y., et al.: Drivermhg: a multi-modal dataset for dynamic recognition of driver micro hand gestures and a real-time recognition framework. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 77–84. IEEE (2020)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012)
Kurakin, A., Zhang, Z., Liu, Z.: A real time system for dynamic hand gesture recognition with a depth sensor. In: 2012 Proceedings of the 20th European signal processing conference (EUSIPCO), pp. 1975–1979. IEEE (2012)
Lai, K., Yanushkevich, S.: An ensemble of knowledge sharing models for dynamic hand gesture recognition. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp 1–7. IEEE (2020)
Lai, K., Yanushkevich, S.N.: CNN+ RNN depth and skeleton based dynamic hand gesture recognition. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 3451–3456. IEEE (2018)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Mahmud, H., Islam, R., Hasan, M.K.: On-air English capital alphabet (ECA) recognition using depth information. Vis. Comput. https://doi.org/10.1007/s00371-021-02065-x. https://link.springer.com/article/10.1007%2Fs00371-021-02065-x
Min, Y., Zhang, Y., Chai, X., et al.: An efficient pointlstm for point clouds based gesture recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5760–5769 (2020). https://doi.org/10.1109/CVPR42600.2020.00580
Molchanov, P., Yang, X., Gupta, S., et al.: Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Nagi, J., Ducatelle, F., Di Caro, G.A., et al.: Max-pooling convolutional neural networks for vision-based hand gesture recognition. In: 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pp. 342–347. IEEE (2011)
Naguri, C.R., Bunescu, R.C.: Recognition of dynamic hand gestures from 3d motion data using LSTM and CNN architectures. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 1130–1133 (2017). https://doi.org/10.1109/ICMLA.2017.00013
Nunez, J.C., Cabido, R., Pantrigo, J.J., et al.: Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recogn. 76, 80–94 (2018)
Oudah, M., Al-Naji, A., Chahl, J.: Hand gesture recognition based on computer vision: a review of techniques. J. Imaging 6(8), 73 (2020)
Pintea, S.L., Zheng, J., Li, X., et al.: Hand-tremor frequency estimation in videos. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)
Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017)
Rogozhnikov, A.: Einops: flexible and powerful tensor operations for readable and reliable code (2018). https://github.com/arogozhnikov/einops
Tao, W., Leu, M.C., Yin, Z.: American sign language alphabet recognition using convolutional neural networks with multiview augmentation and inference fusion. Eng. Appl. Artif. Intell. 76, 202–213 (2018)
Vandersteegen, M., Reusen, W., Van Beeck, K., et al.: Low-latency hand gesture recognition with a low-resolution thermal imager. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 98–99 (2020)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: McIlraith SA, Weinberger KQ (eds) Proceedings of the 32nd AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2–7, 2018, pp. 7444–7452. AAAI Press (2018). https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17135
Zhang, Y., Cao, C., Cheng, J., et al.: Egogesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans. Multimedia 20(5), 1038–1050 (2018)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mahmud, H., Morshed, M.M. & Hasan, M.K. Quantized depth image and skeleton-based multimodal dynamic hand gesture recognition. Vis Comput 40, 11–25 (2024). https://doi.org/10.1007/s00371-022-02762-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-022-02762-1