Abstract
Regarding extreme value theory, the unseen novel classes in the open-set recognition can be seen as the extreme values of training classes. Following this idea, we introduce the margin and coverage distribution to model the training classes. A novel visual-semantic embedding framework — extreme vocabulary learning (EVoL) is proposed; the EVoL embeds the visual features into semantic space in a probabilistic way. Notably, we adopt the vast open vocabulary in the semantic space to help further constraint the margin and coverage of training classes. The learned embedding can directly be used to solve supervised learning, zero-shot learning, and open set recognition simultaneously. Experiments on two benchmark datasets demonstrate the effectiveness of the proposed framework against conventional ways.
Similar content being viewed by others
References
Biederman I. Recognition-by-components: a theory of human image understanding. Psychological Review, 1987, 94(2): 115
Scheirer W J, Jain L P, Boult T E. Probability models for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(11): 2317–2324
Rebuff S A, Kolesnikov A, Lampert C H. iCaRL: incremental classifier and representation learning sylvestre-alvise. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2001–2010
Opelt A, Pinz A, Zisserman A. Incremental learning of object detectors using a visual shape alphabet. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2006, 3–10
Da Q, Yu Y, Zhou Z H. Learning with augmented class by exploiting unlabeled data. In: Proceedings of AAAI Conference on Artificial Intelligence. 2014, 1760–1766
Scheirer W J, de Rezende Rocha A, Sapkota A, Boult T E. Toward open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(7): 1757–1772
Rudd E M, Jain L P, Scheirer W J, Boult T E. The extreme value machine. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(3): 762–768
Bendale A, Boult T. Towards open world recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 1893–1902
Sattar H, Muller S, Fritz M, Bulling A. Prediction of search targets from fixations in open-world settings. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 981–990
Lampert C H, Nickisch H, Harmeling S. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 36(3): 453–465
Frome A, Corrado G S, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T. DeViSE: a deep visual-semantic embedding model. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 2121–2129
Norouzi M, Mikolov T, Bengio S, Singer Y, Shlens J, Frome A, Corrado G S, Dean J. Zero-shot learning by convex combination of semantic embeddings. 2013, arXiv preprint arXiv:1312.5650
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 3111–3119
Kumar Verma V, Arora G, Mishra A, Rai P. Generalized zero-shot learning via synthesized examples. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, 4281–4289
Long T, Xu X, Li Y, Shen F M, Song J K, Shen H T. Pseudo transfer with marginalized corrupted attribute for zero-shot learning. In: Proceedings of 2018 ACM Multimedia Conference on Multimedia Conference. 2018, 4281–4289
Long Y, Liu L, Shen FM, Shao L, Li XL. Zero-shotlearning using synthesised unseen visual data with diffusion regularisation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(10): 2498–2512
Xian Y Q, Lorenz T, Schiele B, Akata Z. Feature generating networks for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, 5542–5551
Fu Y W, Sigal L. Semi-supervised vocabulary-informed learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 5337–5346
Fu Y W, Wang X M, Dong H Z, Jiang Y G, Wang M, Xue X Y, Sigal L. Vocabulary-informed zero-shot and open-set learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019
Bai X, Rao C, Wang X G. Shape vocabulary: a robust and efficient shape representation for shape matching. IEEE Transactions on Image Processing, 2014, 23(9): 3935–3949
Wang X G, Wang B Y, Bai X, Liu W Y, Tu Z W. Max-margin multiple-instance dictionary learning. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 846–854
Zhang L, Xiang T, Gong S G. Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2021–2030
Pan S J, Yang Q. A survey on transfer learning. IEEE Transactions on Data and Knowledge Engineering, 2010, 22(10): 1345–1359
Vilalta R, Drissi Y. A perspective view and survey of meta-learning. Artificial Intelligence Review, 2002, 18(2): 77–95
Thrun S, Pratt L. Learning to Learn: Introduction and Overview. Springer, Boston, MA, 1998
Rohrbach M, Stark M, Schiele B. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2012, 1641–1648
Tommasi T, Caputo B. The more you know, the less you learn: from knowledge transfer to one-shot learning of object categories. In: Proceedings of British Machine Vision Conference. 2009
Li F F, Fergus R, Perona P. A Bayesian approach to unsupervised one-shot learning of object categories. In: Proceedings of IEEE International Conference on Computer Vision. 2003, 1134–1141
Bart E, Ullman S. Cross-generalization: learning novel classes from a single example by feature replacement. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2005, 672–679
Hertz T, Hillel A, Weinshall D. Learning a kernel function for classification with small training samples. In: Proceedings of International Conference on Machine Learning. 2016, 401–408
Fleuret F, Blanchard G. Pattern recognition from one example by chopping. In: Proceedings of the 18th International Conference on Neural Information Processing Systems. 2005, 371–378
Amit Y, Fink M, Srebro N, Ullman S. Uncovering shared structures in multiclass classification. In: Proceedings of International Conference on Machine Learning. 2007, 17–24
Wolf L, Martin I. Robust boosting for learning from few examples. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2005, 359–364
Torralba A, Murphy K, Freeman W. Sharing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 19(5): 854–869
Rohrbach M, Ebert S, Schiele B. Transfer learning in a transductive setting. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 46–54
Rohrbach M, Stark M, Szarvas G, Gurevych I, Schiele B. What helps where — and why? semantic relatedness for knowledge transfer. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2010, 910–917
Torralba A, Murphy K P, Freeman W T. Using the forest to see the trees: exploiting context for visual object detection and localization. Communications of the ACM, 2010, 53(3): 107–114
Akata Z, Reed S, Walter D, Lee H, Schiele B. Evaluation of output embeddings for fine-grained image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 2927–2936
Weston J, Bengio S, Usunier N. Wsabie: scaling up to large vocabulary image annotation. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011, 2764–2770
Akata Z, Perronnin F, Harchaoui Z, Schmid C. Label-embedding for attribute-based classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2013, 819–826
Fu Y W, Hospedales T M, Xiang T, Fu Z Y, Gong S G. Transductive multi-view embedding for zero-shot recognition and annotation. In: Proceedings of European Conference on Computer Vision. 2014, 584–599
Farhadi A, Endres I, Hoiem D, Forsyth D. Describing objects by their attributes. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2009, 1778–1785
Koch G, Zemel R, Salakhutdinov R. Siamese neural networks for one-shot image recognition. In: Proceedings of International Conference on Machine Learning — Deep Learning Workshop. 2015
Kotz S, Nadarajah S. Extreme Value Distributions: Theory and Applications. World Scientific, 2000
Bartlett P, Freund Y, Lee W S, Schapire R E. Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics, 1998, 26(5): 1651–1686
Coles S. An Introduction to Statistical Modeling of Extreme Values. London: Springer, 2001
Fu Y W, Hospedales T M, Xiang T, Gong S G. Transductive multiview zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(11): 2332–2345
Fu Z Y, Xiang T, Kodirov E, Gong S. Zero-shot object recognition by semantic manifold distance. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 2635–2644
Maaten L V D, Hinton G. Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research, 2008, 9(Nov): 2579–2605
Author information
Authors and Affiliations
Corresponding author
Additional information
Hanze Dong is an undergraduate student majoring in mathematics and applied mathematics (data science track) at the School of Data Science, Fudan University, China. He works in Shanghai Key Lab of Intelligent Information Processing under the supervision of Professor Yanwei Fu. His current research interests include both machine learning theory and its applications.
Zhenfeng Sun is a DEng student in the School of Computer Science, Fudan University, China. He received his master degree in Computer Science Department of Tongji University, China in 2003. He had many years of working experience in the video industry. In 2005, he was responsible for deploying Alcatel Lucent’s leading triple play solution in China. In 2013, he joined the world’s largest IPTV operator, BESTV Company. He was responsible for terminal department and participated in a national key research project. He is now the co-founder of OTVClOUD company invested by Yunfeng Fund (set up by Jack Ma, founder of Alibaba). His research interests are video retrieval, recognition and innovative video applications. He participated in and obtained several patents.
Yanwei Fu received the PhD degree from Queen Mary University of London, UK in 2014, and the MEng degree from the Department of Computer Science and Technology, Nanjing University, China in 2011. He held a post-doctoral position at Disney Research, Pittsburgh, PA, USA, from 2015 to 2016. He is currently a tenure-track professor with Fudan University, China. His research interests are image and video understanding, and life-long learning.
Shi Zhong received PhD degree in 2018 at the School of Computer Science, Fudan University, China. His research interests mainly include computer vision and machine learning.
Zhengjun Zhang is full professor of Statistics in the Department of Statistics at University of Wisconsin-Madison, USA. He received his PhD degrees in Management Engineering and Statistics from Beihang University and the University of North Carolina at Chapel Hill, respectively. Dr. Zhang’s main research areas include the Big Data structure and inference, particularly in extreme value analysis for interdependent critical risk variables in finance, climate, and medical sciences, in stochastic optimizations in large and complex systems. Some of his selected journal publications are Annals of Statistics, Journal of Royal Statistical Society, Series B, Journal of American Statistical Association, Journal of Econometrics, Journal of Banking and Finance, Extremes, and Automatica.
Yu-Gang Jiang is professor of Computer Science at Fudan University and Director of Fudan-Jilian Joint Research Center on Intelligent Video Technology, China. He is interested in all aspects of extracting high-level information from big video data, such as video event recognition, object/scene recognition and largescale visual search. His work has led to many awards, including the inaugural ACM China Rising Star Award, the 2015 ACM SIGMM Rising Star Award, and the research award for outstanding young researchers from NSF China. He is currently an associate editor of ACM TOMM, Machine Vision and Applications (MVA) and Neurocomputing. He holds a PhD in Computer Science from City University of Hong Kong and spent three years working at Columbia University before joining Fudan in 2011.
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Dong, H., Sun, Z., Fu, Y. et al. Extreme vocabulary learning. Front. Comput. Sci. 14, 146315 (2020). https://doi.org/10.1007/s11704-019-8249-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-019-8249-3