Abstract
In this paper, we propose a novel approach to introduce semantic relations into the bag-of-words framework. We use the latent semantic models, such as latent semantic analysis (LSA) and probabilistic latent semantic analysis (pLSA), in order to define semantically rich features and embed the visual features into a semantic space. The semantic features used in LSA technique are derived from the low-rank approximation of word–image occurrence matrix by singular value decomposition. Similarly, by using the pLSA approach, the topic-specific distributions of words can be considered dimensions of a concept space. In the proposed space, the distances between words represent the semantic distances which are used for constructing a discriminative and semantically meaningful vocabulary. Position information significantly improves scene recognition accuracy. Inspired by this, in this paper, we bring position information into the proposed semantic vocabulary frameworks. We have tested our approach on the 15-Scene and 67-MIT Indoor datasets and have achieved very promising results.













Similar content being viewed by others
Notes
Polysemy is the existence of words which convey different concepts in different images. For instance in text domain, the word table can either be interpreted as a piece of furniture or an arrangement of data.
References
Bosch, A., Zisserman, A., Muñoz, X.: Scene classification via pLSA. In: European Conference on Computer Vision (ECCV) (2006)
Bosch, A., Zisserman, A., Muoz, X.: Scene classification using a hybrid generative/discriminative approach. IEEE Trans. Pattern Anal. Mach. Intell. 30(4), 712–727 (2008)
Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: International Workshop on Statistical Learning in Computer Vision, ECCV (2004)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Farahzadeh, E., Cham, T.J., Li, W.: Incorporating local and global information using a novel distance function for scene recognition. In: IEEE Workshop on Robot Vision, Winter Vision Meetings (WVM) (2013)
Fei-Fei, L., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2005)
Grauman, K., Darrell, T.: The pyramid match kernel: discriminative classification with sets of image features. In: IEEE International Conference on Computer Vision (ICCV) (2005)
Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from rgb-d images. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2013)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)
Shotton, J., Johnson, R.C. M.: Semantic texton forests for image categorization and segmentation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2008)
Kwitt, R., Vasconcelos, N., Rasiwasia, N.: Scene recognition on the semantic manifold. In: European Conference on Computer Vision (ECCV) (2012)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2006)
Li, L.J., Su, H., Xing, E.P., Fei-Fei, L.: Object bank: A high-level image representation for scene classification and semantic feature sparsification. In: Neural Information Processing Systems (NIPS) (2010)
Li, X., Guo, Y.: An object co-occurrence assisted hierarchical model for scene understanding. In: British Computer Vision Conference (BMVC) (2012)
Liu, D., Chen, T.: Unsupervised image categorization and object localization using topic models and correspondences between images. In: IEEE International Conference on Computer Vision (ICCV) (2007)
Liu, J., Shah, M.: Scene modeling using co-clustering. In: IEEE International Conference on Computer Vision (ICCV) (2007)
Liu, J., Shah, M.: Learning human action via information maximization. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2008)
Liu, J., Yang, Y., Shah, M.: Learning semantic visual vocabularies using diffusion distance. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Lowe, D.: Object recognition from local scale-invariant features. In: IEEE International Conference on Computer Vision (ICCV) (1999)
Moosmann, F., Triggs, B., Jurie, F.: Fast discriminative visual codebooks using randomized clustering forests. In: Neural Information Processing Systems (NIPS) (2006)
Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial–temporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008)
Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)
Felzenszwalb, P.F., Girshick, D.M.R.B., Ramanan, D.: Object detection with discriminatively trained part-based models. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2010)
Pandey, M., Lazebnik, S.: Scene recognition and weakly supervised object localization with deformable part-based models. In: IEEE International Conference on Computer Vision (ICCV) (2011)
Parizi, S., Oberlin, J., Felzenszwalb, P.: Reconfigurable models for scene recognition. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2012)
Quattoni, A., Torralba, A.: Indoor scene recognition. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Quelhas, P., Monay, F., Odobez, J.m., Gatica-perez, D., Tuytelaars, T., Van Gool, L.: Modeling scenes with local descriptors and latent aspects. In: IEEE International Conference on Computer Vision (ICCV) (2005)
Saghafi, B., Farahzadeh, E., Rajan, D., Sluzek, A.: Embedding visual words into concept space for action and scene recognition. In: British Machine Vision Conference (BMVC) (2010)
Singh, S., Gupta, A., Efros, A.A.: Unsupervised discovery of mid-level discriminative patches. In: European Conference on Computer Vision (ECCV) (2012)
Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering objects and their location in images. In: IEEE International Conference on Computer Vision (ICCV) (2005)
Vogel, J., Schiele, B.: Natural scene retrieval based on a semantic modeling step. In: ACM International Conference on Image and Video Retrieval (CIVR) (2004)
Wu, J., Rehg, J.M.: CENTRIST: a visual descriptor for scene categorization. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1489–1501 (2011)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Farahzadeh, E., Cham, TJ. & Sluzek, A. Scene recognition by semantic visual words. SIViP 9, 1935–1944 (2015). https://doi.org/10.1007/s11760-014-0687-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-014-0687-7