Abstract
An increasing number of different multimedia information, including text, voice, video and image, are used to describe the same semantic concept together on the Internet. This paper presents a new method to more efficiently cross-modal multimedia retrieval. Using image and text as an example, we learn the deep learning features of images by convolution neural networks, and learn the text features by a latent Dirichlet allocation model. Then map the two features spaces into a shared presentation space by a probability model in order that they are isomorphic. At last, we adopt centered correlation to measure the distance between them. The experimental results in the Wikipedia dataset show that our approach can achieve the state-of-the-art results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Yang, Y., Xu, D., Nie, F., Luo, J., Zhuang, Y.: Ranking with local regression and global alignment for cross media retrieval. In: International Conference on Multimedia, pp. 175–184 (2009)
Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep Boltzmann machines. In: Neural Information Processing Systems, pp. 2222–2230 (2012)
Lu, X., Wu, F., Tang, S.: A low rank structural large margin method for cross-modal ranking. In: Research and Development in Information Retrieval, pp. 433–442 (2013)
Lu, X., Wu, F., Tang, S., Zhang, Z., He, X., Zhuang, Y.: Cross-media semantic representation via bi-directional learning to rank. In: International Conference on Multimedia, pp. 877–886 (2013)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: International Conference on Machine Learning, pp. 282–289 (2001)
Xu, X.S., Jiang, Y., Peng, L., Xue, X., Zhou, Z.H.: Ensemble approach based on conditional random field for multi-label image and video annotation. In: International Conference on Multimedia, pp. 1377–1380 (2011)
Zhang, Y., Li, G., Chu, L., Wang, S., Zhang, W., Huang, Q.: Cross-media topic detection: a multi-modality fusion framework. In: International Conference on IEEE, pp. 1–6 (2013)
Li, L., Jiang, S., Huang, Q.: Learning image vicept description via mixed-norm regularization for large scale semantic image search. In: Computer Vision and Pattern Recognition, pp. 825–832 (2011)
Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G.R., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: International Conference on Multimedia, pp. 251–260 (2010)
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems, pp. 1097–1105 (2012)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Computer Vision and Pattern Recognition Workshops, pp. 512–519 (2014)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Conference on Uncertainty in Artificial Intelligence, pp. 487–494 (2004)
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Conference on Empirical Methods in Natural Language Processing, pp. 248–256 (2009)
Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link LDA: joint models of topic and author community. In: Annual International Conference on Machine Learning, pp. 665–672 (2009)
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: International Conference on Machine Learning, pp. 689–696 (2011)
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: International Conference on Multimedia, pp. 675–678 (2014)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: International Conference on Machine Learning, pp. 807–814 (2010)
Li, J., Luo, W., Yang, J., Yuan, X.: Why Does The Unsupervised Pretraining Encourages Moderate-Sparseness. arXiv Preprint arXiv:1312.5813 (2013)
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving Neural Networks by Preventing Co-adaptation of Feature Detectors. arXiv Preprint arXiv:1207.0580 (2012)
Wang, W., Ooi, B.C., Yang, X., Zhang, D., Zhuang, Y.: Effective multi-modal retrieval based on stacked auto-encoders. Proc. VLDB Endowment 7(8), 649–660 (2014)
Wu, F., Jiang, X., Li, X., Tang, S., Lu, W., Zhang, Z., Zhuang, Y.: Cross-modal learning to rank via latent joint representation. Image Process. 24(5), 1497–1509 (2015)
Ling, L., Zhai, X., Peng, Y.: Tri-space and ranking based heterogeneous similarity measure for cross-media retrieval. In: Pattern Recognition International Conference on IEEE, pp. 230–233 (2012)
Acknowledgement
This work was supported by the Grant of the National Science Foundation of China (No. 61175121, 61502183), the Grant of the National Science Foundation of Fujian Province (No. 2013J06014), the Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (No. ZQN-YX108), the Scientific Research Funds of Huaqiao University (No. 600005-Z15Y0016), and Subsidized Project for Cultivating Postgraduates’ Innovative Ability in Scientific Research of Huaqiao University (Nos. 1400214009, 1400214003).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Zou, H., Du, JX., Zhai, CM., Wang, J. (2016). Deep Learning and Shared Representation Space Learning Based Cross-Modal Multimedia Retrieval. In: Huang, DS., Jo, KH. (eds) Intelligent Computing Theories and Application. ICIC 2016. Lecture Notes in Computer Science(), vol 9772. Springer, Cham. https://doi.org/10.1007/978-3-319-42294-7_28
Download citation
DOI: https://doi.org/10.1007/978-3-319-42294-7_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42293-0
Online ISBN: 978-3-319-42294-7
eBook Packages: Computer ScienceComputer Science (R0)