Abstract
The multimodal approach is becoming more and more attractive and common method in multimedia information retrieval and description. It often shows better content recognition results than using only unimodal methods, but depending on the used data, this is not always the case. Most of the current multimodal media content classification methods still depend on unimodal recognition results. For both uni- and multimodal approaches it is important to choose the best features and classification models. In addition, in the case of unimodal models, the final multimodal recognitions still need to be produced with an appropriate late fusion technique. In this article, we study several multi- and unimodal recognition methods, features for them and their combination techniques, in the application setup of concept detection in image–text data. We consider both single- and multi-label recognition tasks. As the image features, we use GoogLeNet deep convolutional neural network (DCNN) activation features and semantic concept or classeme vectors. For text features, we use simple binary vectors for tags and the word2vec embedding vectors. The Multimodal Deep Boltzmann Machine (DBM) model is used as the multimodal model and the Support Vector Machine (SVM) with both linear and non-linear radial basis function (RBF) kernels as the unimodal one. The experiments are performed with the MIRFLICKR-1M and the NUS-WIDE datasets. The results show that the two models have equally good performance in the single-label recognition task of the former database, while the Multimodal DBM produces clearly better results in the multi-label task of the latter database. Compared with the results in the literature, we exceed the state of the art in both datasets, mostly due to the use of DCNN features and semantic concept vectors based on them.
Similar content being viewed by others
References
Arandjelović R, Zisserman A (2013) All about VLAD. IEEE conference on computer vision and pattern recognition
Bahrampour S, Nasrabadi NM, Ray A, Jenkins WK (2016) Multimodal task-driven dictionary learning for image classification. IEEE Trans Image Process 25 (1):24–38
Bhatia K, Jain H, Kar P, Varma M, Jain P (2015) Sparse local embeddings for extreme multi-label classification. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28. Curran Associates, Inc, pp 730–738. http://papers.nips.cc/paper/5969-sparse-local-embeddings-for-extreme-multi-label-classification.pdf
Bosch A, Zisserman A, Munoz X (2007) Representing shape with a spatial pyramid kernel Proceedings of ACM ICVR 2007, pp 401–408
Boyd-Graber JL, Blei DM (2009) Syntactic topic models Advances in neural information processing systems, pp 185–192
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng YT (2009) Nus-wide: A real-world web image database from national university of Singapore Proceedings of ACM conference on image and video retrieval (CIVR’09). santorini, Greece
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) deCAF: A deep convolutional activation feature for generic visual recognition ICML 2014
Erk K, Padó S (2008) A structured vector space model for word meaning in context Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 897–906
Escalera S, Athitsos V, Guyon I (2016) Challenges in multimodal gesture recognition. J Mach Learn Res 17(72):1–54
Fan R, Chang K, Hsieh C, Wang X, Lin C (2008) LIBLINEAR: A library for large linear classification. J Mach Learn Res 9:1871–1874
Gong Y, Ke Q, Isard M, Lazebnik S (2012) A multi-view embedding space for modeling internet images, tags, and their semantics. CoRR arXiv:1212.4522
Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. arXiv:1403.1840
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections European conference on computer vision, pp 529–545
Guillaumin M, Verbeek J, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: CVPR 2010 - 23Rd IEEE conference on computer vision & pattern recognition. IEEE Computer Society, San Francisco, USA, pp 902–909
Habibian A, van de Sande KE, Snoek CG (2013) Recommendations for video event recognition using concept vocabularies. In: Proceedings of the 3rd ACM conference on international conference on multimedia retrieval, ICMR ’13, pp 89–96. ACM, New York, NY, USA. doi:10.1145/2461466.2461482
Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14:1771–1800
Hinton GE, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554
Hinton GE, Salakhutdinov R (2009) Replicated softmax: an undirected topic model. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22. Curran Associates, Inc, pp 1607–1614. http://papers.nips.cc/paper/3856-replicated-softmax-an-undirected-topic-model.pdf
Huiskes MJ, Lew MS (2008) The MIR Flickr retrieval evaluation
Ishikawa S, Laaksonen J (2016) Comparing and combining unimodal methods for multimodal recognition Proceedings of the 14th international workshop on content-based multimedia indexing (CBMI). bucharest, Romania
Jegou H, Douze M, Schmid C, Perez P (2010) Aggregating local descriptors into a compact image representation. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR)
Jia Y (2013) Caffe: An open source convolutional architecture for fast feature embedding http://caffe.berkeleyvision.org/
Kara S, Alan Ö, Sabuncu O, Akpınar S, Cicekli NK, Alpaslan FN (2012) An ontology-based retrieval system using semantic indexing. Inf Syst 37(4):294–305
Koskela M, Laaksonen J (2014) Convolutional network features for scene recognition Proceedings of the 22nd ACM international conference on multimedia. Orlando, Florida
Li C, Wang B, Pavlu V, Aslam J (2016) Conditional bernoulli mixtures for multi-label classification Proceedings of the 33rd international conference on machine learning, pp 2482–2491
Li LJ, Su H, Fei-Fei L, Xing EP (2010) Object bank: a high-level image representation for scene classification & semantic feature sparsification Advances in neural information processing systems, pp 1378–1386
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Manjunath BS, Ohm JR, Vasudevan VV, Yamada A (2001) Color and texture descriptors. IEEE Trans Circuits Syst Video Technol 11(6):703–715
Merler M, Huang B, Xie L, Hua G, Natsev A (2012) Semantic model vectors for complex video event recognition. Trans Multi 14(1):88–101. doi:10.1109/TMM.2011.2168948
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. CoRR arXiv:1301..3781
Oliva A, Torralba A (2001) Modeling the shape of the scene: A holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175. doi:10.1023/A:1011139631724
Sjöberg M, Laaksonen J (2014) Using semantic features to improve large-scale visual concept detection Proceedings of the 12th International Workshop on Content Based Multimedia Indexing (CBMI 2014), pp 1–6. IEEE, Klagenfurt, Austria. doi:10.1109/CBMI.2014.6849817
Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines Advances in neural information processing systems 2012, pp 2222–2230
Srivastava N, Salakhutdinov R (2014) Multimodal learning with deep boltzmann machines. J Mach Learn Res 15:2949–2980
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. arXiv:1409.4842
Torresani L, Szummer M, Fitzgibbon A (2010) Efficient object category recognition using classemes European Conference on Computer Vision (ECCV), pp 776–789. http://research.microsoft.com/pubs/136846/TorresaniSzummerFitzgibbon-classemes-eccv10.pdf
Vedaldi A, Fulkerson B VLFeat: A library of computer vision algorithms. http://www.vlfeat.org/
Vedaldi A, Zisserman A (2010) Efficient additive kernels via explicit feature maps Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR 2010)
Verbeek JJ, Guillaumin M, Mensink T, Schmid C (2010) Image annotation with tagprop on the mirflickr set. ACM, New York, NY, USA
van de Weijer J, Schmid C (2006) Coloring local feature extraction Proceedings ECCV 2006
Zhang H, Shang X, Luan H, Wang M, Chua TS (2016) Learning from collective intelligence: Feature learning using largely social images and tags. In: ACM transactions on multimedia computing, communications and applications
Zhao F, Huang Y, Wang L, Tan T (2015) Deep semantic ranking based hashing for multi-label image retrieval The IEEE conference on computer vision and pattern recognition (CVPR)
Acknowledgment
This work has been funded by the grant 251170 of the Academy of Finland and the Data to Intelligence D2I DIGILE SHOK program. The calculations were performed using computer resources within the Aalto University School of Science “Science-IT” project.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ishikawa, S., Laaksonen, J. Uni- and multimodal methods for single- and multi-label recognition. Multimed Tools Appl 76, 22405–22423 (2017). https://doi.org/10.1007/s11042-017-4733-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-4733-7