Abstract
As a research hotspot in the field of human-machine interaction, a great progress of hand gesture recognition has been achieved with the development of deep learning of neural networks. However, in the deep learning based recognition methods, it is necessary to rely heavily on large-scale labeled dataset which is very hard to build in practical applications. In order to achieve a well performance under some strict constraint of few sample data, one-shot learning gesture recognition is studied and a joint deep training method by combination of 3D ResNet with a memory module is presented in this paper. In our scheme a combinatorial optimization of feature extraction by 3D ResNet with memory capacity of rare event by memory module is carried out with an effective strategy of optimal decision and two relative performance indices. In order to implement one-shot learning gesture recognition, the memory module is employed to remember the features extracted by well-trained 3D ResNet and the classification decision is performed by the nearest neighbor algorithm with cosine similarity measure. In view of real-world applications about human-machine interaction technology, its ability to deal with negative samples plays a significant role thus a mechanism based on the threshold of cosine similarity is built to realize effective classification and rejection respectively. In order to validate and evaluate the performance of our proposed method, a special hand gesture dataset containing 3045 gesture videos is built and a series of experiment results on our collected dataset and public datasets demonstrate the feasibility and effectiveness of our method.
Similar content being viewed by others
References
Bertinetto L, Henriques JF, Valmadre J, Torr P, Vedaldi A (2016) Learning feed-forward one-shot learners. In: Advances in neural information processing systems, pp 523–531
Cai Q, Pan Y, Yao T, Yan C, Mei T (2018) Memory matching networks for one-shot image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4080– 4088
Fe-Fei L (2003) A Bayesian approach to unsupervised one-shot learning of object categories. In: Ninth IEEE international conference on computer vision, 2003. Proceedings. IEEE, pp 1134–1141
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. arXiv:170303400
Girija SS (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256
Guo J, Yuan C, Zhao Z, Feng P, Wang T, Liu F (2018) Bi-branch deconvolution-based convolutional neural network for image classification. Multimed Tools Appl 77(23):30233–30250
Guyon I, Athitsos V, Jangyodsuk P, Escalante HJ (2014) The ChaLearn gesture dataset (CGD 2011). Mach Vis Appl 25(8):1929–1951
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision. Springer, pp 630–645
Holzinger A, Kieseberg P, Weippl E, Tjoa AM (2018) Current advances, trends and challenges of machine learning and knowledge extraction: from machine learning to explainable AI. In: Springer lecture notes in computer science LNCS 11015. Springer International, pp 1–8
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:150203167
Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Kaiser Ł, Nachum O, Roy A, Bengio S (2017) Learning to remember rare events. arXiv:170303129
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:14126980
Koch G, Zemel R, Salakhutdinov R (2015) Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop
Konečný J, Hagara M (2014) One-shot-learning gesture recognition using hog-hof features. J Mach Learning Res 15(1):2513–2532
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Li J, Tao J, Ding L, Gao H, Deng Z, Luo Y, Li Z (2018) A new iterative synthetic data generation method for CNN based stroke gesture recognition. Multimed Tools Appl 77(13):17181–17205
Li Y, Miao Q, Tian K, Fan Y, Xu X, Li R, Song J (2016) Large-scale gesture recognition with a fusion of rgb-d data based on the c3d model. In: 2016 23rd international conference on pattern recognition (ICPR). IEEE, pp 25–30
Lin J, Ruan X, Yu N, Wei R (2015) One-shot learning gesture recognition based on improved 3D SMoSIFT feature descriptor from RGB-D videos. In: 2015 27th Chinese control and decision conference (CCDC). IEEE, pp 4911–4916
Lin J, Ruan X, Yu N, Yang Y-H (2016) Adaptive local spatiotemporal features from RGB-d data for one-shot learning gesture recognition. Sensors 16(12):2171
Loshchilov I, Hutter F (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv:160803983
Malgireddy MR, Inwogu I, Govindaraju V (2012) A temporal bayesian model for classifying, detecting and localizing activities in video sequences. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 43–48
Malgireddy MR, Nwogu I, Govindaraju V (2013) Language-motivated approaches to action recognition. J Mach Learning Res 14(1):2189–2212
Molchanov P, Gupta S, Kim K, Kautz J (2015) Hand gesture recognition with 3D convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1–7
Molchanov P, Gupta S, Kim K, Pulli K (2015) Multi-sensor system for driver’s hand-gesture recognition. In: 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, pp 1–8
Molchanov P, Yang X, Gupta S, Kim K, Tyree S, Kautz J (2016) Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4207–4215
Munkhdalai T, Yu H (2017) Meta networks. arXiv:170300837
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 5534–5542
Ravi S, Larochelle H (2016) Optimization as a model for few-shot learning
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211– 252
Santoro A, Bartunov S, Botvinick M, Wierstra D, Lillicrap T (2016) One-shot learning with memory-augmented neural networks. arXiv:160506065
Shahroudy A, Liu J, Ng T-T, Wang G (2016) NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:14091556
Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. In: Advances in neural information processing systems, pp 4077–4087
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Information Processing & Management 45(4):427–437
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Tran D, Ray J, Shou Z, Chang S-F, Paluri M (2017) Convnet architecture search for spatiotemporal feature learning. arXiv:170805038
Maaten Lvd, Hinton G (2008) Visualizing data using t-SNE. J Mach Learning Res 9(Nov):2579–2605
Veit A, Wilber MJ, Belongie S (2016) Residual networks behave like ensembles of relatively shallow networks. In: Advances in neural information processing systems, pp 550–558
Vinyals O, Blundell C, Lillicrap T, Wierstra D (2016) Matching networks for one shot learning. In: Advances in neural information processing systems, pp 3630–3638
Wan J, Athitsos V, Jangyodsuk P, Escalante HJ, Ruan Q, Guyon I (2014) CSMMI: class-specific maximization of mutual information for action and gesture recognition. IEEE Transactions on Image Processing 23(7):3152–3165
Wan J, Guo G, Li SZ (2016) Explore efficient local features from RGB-d data for one-shot learning gesture recognition. IEEE Trans Pattern Anal Mach Intell 38(8):1626–1639
Wan J, Ruan Q, Li W, An G, Zhao R (2014) 3D SMoSIFT: three-dimensional sparse motion scale invariant feature transform for activity recognition from RGB-D videos. J Electron Imaging 23(2):023017
Wan J, Ruan Q, Li W, Deng S (2013) One-shot learning gesture recognition from RGB-d data using bag of features. J Mach Learning Res 14(1):2549–2582
Wan J, Zhao Y, Zhou S, Guyon I, Escalera S, Li SZ (2016) Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 56–64
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision. Springer, pp 20–36
Wang T, Chen Y, Zhang M, Chen J, Snoussi H (2017) Internal transfer learning for improving performance in human action recognition for small datasets. IEEE Access 5:17627–17633
Weston J, Chopra S, Bordes A (2014) Memory networks. arXiv:14103916
Xuejiao L, Yongqing S (2017) Tracking skeletal fusion feature for one shot learning gesture recognition. In: 2017 2nd international conference on image, vision and computing (ICIVC). IEEE, pp 194–200
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702
Zhang H, Xia C, Gao X (2019) Action recognition based on multi-stage jointly training convolutional network. Multimed Tools Appl 78(8):9919–9931
Zhang L, Zhu G, Shen P, Song J, Shah SA, Bennamoun M (2017) Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3120–3128
Zhang Y, Cao C, Cheng J, Lu H (2018) Egogesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans Multimed 20 (5):1038–1050
Zhu G, Zhang L, Mei L, Shao J, Song J, Shen P (2016) Large-scale isolated gesture recognition using pyramidal 3d convolutional networks. In: 2016 23rd international conference on pattern recognition (ICPR). IEEE, pp 19–24
Zhu G, Zhang L, Shen P, Song J (2017) Multimodal gesture recognition using 3-D convolution and convolutional LSTM. IEEE Access 5:4517–4524
Acknowledgements
This work was supported in part by National Natural Science Foundation of China (Grant No. 61731001) and SONY.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, L., Qin, S., Lu, Z. et al. One-shot learning gesture recognition based on joint training of 3D ResNet and memory module. Multimed Tools Appl 79, 6727–6757 (2020). https://doi.org/10.1007/s11042-019-08429-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08429-9