Uni- and multimodal methods for single- and multi-label recognition | Multimedia Tools and Applications Skip to main content
Log in

Uni- and multimodal methods for single- and multi-label recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The multimodal approach is becoming more and more attractive and common method in multimedia information retrieval and description. It often shows better content recognition results than using only unimodal methods, but depending on the used data, this is not always the case. Most of the current multimodal media content classification methods still depend on unimodal recognition results. For both uni- and multimodal approaches it is important to choose the best features and classification models. In addition, in the case of unimodal models, the final multimodal recognitions still need to be produced with an appropriate late fusion technique. In this article, we study several multi- and unimodal recognition methods, features for them and their combination techniques, in the application setup of concept detection in image–text data. We consider both single- and multi-label recognition tasks. As the image features, we use GoogLeNet deep convolutional neural network (DCNN) activation features and semantic concept or classeme vectors. For text features, we use simple binary vectors for tags and the word2vec embedding vectors. The Multimodal Deep Boltzmann Machine (DBM) model is used as the multimodal model and the Support Vector Machine (SVM) with both linear and non-linear radial basis function (RBF) kernels as the unimodal one. The experiments are performed with the MIRFLICKR-1M and the NUS-WIDE datasets. The results show that the two models have equally good performance in the single-label recognition task of the former database, while the Multimodal DBM produces clearly better results in the multi-label task of the latter database. Compared with the results in the literature, we exceed the state of the art in both datasets, mostly due to the use of DCNN features and semantic concept vectors based on them.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://mulan.sourceforge.net

References

  1. Arandjelović R, Zisserman A (2013) All about VLAD. IEEE conference on computer vision and pattern recognition

  2. Bahrampour S, Nasrabadi NM, Ray A, Jenkins WK (2016) Multimodal task-driven dictionary learning for image classification. IEEE Trans Image Process 25 (1):24–38

    Article  MathSciNet  Google Scholar 

  3. Bhatia K, Jain H, Kar P, Varma M, Jain P (2015) Sparse local embeddings for extreme multi-label classification. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28. Curran Associates, Inc, pp 730–738. http://papers.nips.cc/paper/5969-sparse-local-embeddings-for-extreme-multi-label-classification.pdf

  4. Bosch A, Zisserman A, Munoz X (2007) Representing shape with a spatial pyramid kernel Proceedings of ACM ICVR 2007, pp 401–408

    Google Scholar 

  5. Boyd-Graber JL, Blei DM (2009) Syntactic topic models Advances in neural information processing systems, pp 185–192

    Google Scholar 

  6. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27

    Article  Google Scholar 

  7. Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng YT (2009) Nus-wide: A real-world web image database from national university of Singapore Proceedings of ACM conference on image and video retrieval (CIVR’09). santorini, Greece

  8. Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) deCAF: A deep convolutional activation feature for generic visual recognition ICML 2014

  9. Erk K, Padó S (2008) A structured vector space model for word meaning in context Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 897–906

  10. Escalera S, Athitsos V, Guyon I (2016) Challenges in multimodal gesture recognition. J Mach Learn Res 17(72):1–54

    MathSciNet  Google Scholar 

  11. Fan R, Chang K, Hsieh C, Wang X, Lin C (2008) LIBLINEAR: A library for large linear classification. J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  12. Gong Y, Ke Q, Isard M, Lazebnik S (2012) A multi-view embedding space for modeling internet images, tags, and their semantics. CoRR arXiv:1212.4522

  13. Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. arXiv:1403.1840

  14. Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections European conference on computer vision, pp 529–545

    Google Scholar 

  15. Guillaumin M, Verbeek J, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: CVPR 2010 - 23Rd IEEE conference on computer vision & pattern recognition. IEEE Computer Society, San Francisco, USA, pp 902–909

  16. Habibian A, van de Sande KE, Snoek CG (2013) Recommendations for video event recognition using concept vocabularies. In: Proceedings of the 3rd ACM conference on international conference on multimedia retrieval, ICMR ’13, pp 89–96. ACM, New York, NY, USA. doi:10.1145/2461466.2461482

  17. Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14:1771–1800

    Article  MATH  Google Scholar 

  18. Hinton GE, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554

    Article  MathSciNet  MATH  Google Scholar 

  19. Hinton GE, Salakhutdinov R (2009) Replicated softmax: an undirected topic model. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22. Curran Associates, Inc, pp 1607–1614. http://papers.nips.cc/paper/3856-replicated-softmax-an-undirected-topic-model.pdf

  20. Huiskes MJ, Lew MS (2008) The MIR Flickr retrieval evaluation

  21. Ishikawa S, Laaksonen J (2016) Comparing and combining unimodal methods for multimodal recognition Proceedings of the 14th international workshop on content-based multimedia indexing (CBMI). bucharest, Romania

  22. Jegou H, Douze M, Schmid C, Perez P (2010) Aggregating local descriptors into a compact image representation. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR)

  23. Jia Y (2013) Caffe: An open source convolutional architecture for fast feature embedding http://caffe.berkeleyvision.org/

  24. Kara S, Alan Ö, Sabuncu O, Akpınar S, Cicekli NK, Alpaslan FN (2012) An ontology-based retrieval system using semantic indexing. Inf Syst 37(4):294–305

    Article  Google Scholar 

  25. Koskela M, Laaksonen J (2014) Convolutional network features for scene recognition Proceedings of the 22nd ACM international conference on multimedia. Orlando, Florida

  26. Li C, Wang B, Pavlu V, Aslam J (2016) Conditional bernoulli mixtures for multi-label classification Proceedings of the 33rd international conference on machine learning, pp 2482–2491

    Google Scholar 

  27. Li LJ, Su H, Fei-Fei L, Xing EP (2010) Object bank: a high-level image representation for scene classification & semantic feature sparsification Advances in neural information processing systems, pp 1378–1386

    Google Scholar 

  28. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  29. Manjunath BS, Ohm JR, Vasudevan VV, Yamada A (2001) Color and texture descriptors. IEEE Trans Circuits Syst Video Technol 11(6):703–715

    Article  Google Scholar 

  30. Merler M, Huang B, Xie L, Hua G, Natsev A (2012) Semantic model vectors for complex video event recognition. Trans Multi 14(1):88–101. doi:10.1109/TMM.2011.2168948

    Article  Google Scholar 

  31. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. CoRR arXiv:1301..3781

  32. Oliva A, Torralba A (2001) Modeling the shape of the scene: A holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175. doi:10.1023/A:1011139631724

    Article  MATH  Google Scholar 

  33. Sjöberg M, Laaksonen J (2014) Using semantic features to improve large-scale visual concept detection Proceedings of the 12th International Workshop on Content Based Multimedia Indexing (CBMI 2014), pp 1–6. IEEE, Klagenfurt, Austria. doi:10.1109/CBMI.2014.6849817

  34. Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines Advances in neural information processing systems 2012, pp 2222–2230

    Google Scholar 

  35. Srivastava N, Salakhutdinov R (2014) Multimodal learning with deep boltzmann machines. J Mach Learn Res 15:2949–2980

    MathSciNet  MATH  Google Scholar 

  36. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. arXiv:1409.4842

  37. Torresani L, Szummer M, Fitzgibbon A (2010) Efficient object category recognition using classemes European Conference on Computer Vision (ECCV), pp 776–789. http://research.microsoft.com/pubs/136846/TorresaniSzummerFitzgibbon-classemes-eccv10.pdf

  38. Vedaldi A, Fulkerson B VLFeat: A library of computer vision algorithms. http://www.vlfeat.org/

  39. Vedaldi A, Zisserman A (2010) Efficient additive kernels via explicit feature maps Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR 2010)

  40. Verbeek JJ, Guillaumin M, Mensink T, Schmid C (2010) Image annotation with tagprop on the mirflickr set. ACM, New York, NY, USA

    Book  Google Scholar 

  41. van de Weijer J, Schmid C (2006) Coloring local feature extraction Proceedings ECCV 2006

  42. Zhang H, Shang X, Luan H, Wang M, Chua TS (2016) Learning from collective intelligence: Feature learning using largely social images and tags. In: ACM transactions on multimedia computing, communications and applications

  43. Zhao F, Huang Y, Wang L, Tan T (2015) Deep semantic ranking based hashing for multi-label image retrieval The IEEE conference on computer vision and pattern recognition (CVPR)

Download references

Acknowledgment

This work has been funded by the grant 251170 of the Academy of Finland and the Data to Intelligence D2I DIGILE SHOK program. The calculations were performed using computer resources within the Aalto University School of Science “Science-IT” project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Satoru Ishikawa.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ishikawa, S., Laaksonen, J. Uni- and multimodal methods for single- and multi-label recognition. Multimed Tools Appl 76, 22405–22423 (2017). https://doi.org/10.1007/s11042-017-4733-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-4733-7

Keywords

Navigation