Abstract
Large content networks like the World Wide Web contain huge amounts of information that have the potential of being integrated because their components fit within common concepts and/or are connected through hidden, implicit relationships. One attempt at such an integration is the program called the “Web of Data,” which is an evolution of the Semantic Web. It targets semi-structured information sources such as Wikipedia and turns them into fully structured ones in the form of Web-based databases like DBpedia and then integrates them with other public databases such as Geonames. On the other hand, the vast majority of the information residing on the Web is still totally unstructured, which is the starting point for our approach that aims to integrate unstructured information sources. For this purpose, we exploit techniques from Probabilistic Topic Modeling, in order to cluster Web pages into concepts (topics), which are then related through higher-level concept networks; we also make implicit semantic relationships emerge between single Web pages. The approach has been tested through a number of case studies that are here described. While the applicative focus of the research reported here is on knowledge integration on the specific and relevant case of the WWW, the wider aim is to provide a framework for integration generally applicable to all complex content networks where information propagates from multiple sources.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agarwal, D. and Chen, B., “flda: matrix factorization through latent dirichlet allocation,” in Proc. of the third ACM international conference on Web search and data mining, ACM, pp. 91–100, 2010.
Fontana, F.A., Formato, F. and Pareschi, R., “Boosting concept discovery in collective intelligences,” Brain Informatics, 5819, Springer, pp. 214–224, 2009.
Barbieri, N. and Manco, G., “An analysis of probabilistic methods for top-n recommendation in collaborative filtering,” Machine Learning and Knowledge Discovery in Databases, pp. 172–187, 2011.
Berners-Lee, T., Hendler, J. and Lassila, O., “The semantic web,” Scientific american, 284, 5, pp. 28–37, 2001.
Bizer, C., Heath, T. and Berners-Lee, T., “Linked data-the story so far," International Journal on Semantic Web and Information Systems, 5, 3, pp. 1–22, 2009.
Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R. and Hellmann, S., “DBpedia-A crystallization point for the Web of Data,” Web Semantics: Science, Services and Agents on the World Wide Web, 7, 3, pp. 154–165, 2009.
Blei, D. and Jordan, M., “Modeling annotated data,” in Proc. of the 26th Annual International ACM Conference on Research and Development in Informaion Retrieval, ACM, pp. 127–134, 2003.
Blei, D., Ng, A. and Jordan, M., “Latent dirichlet allocation,” the Journal of machine Learning research, 3, pp. 993–1022, 2003.
Chang, J., Gerrish, S., Wang, C., Boyd-graber, J. and Blei, D., “Reading tea leaves: How humans interpret topic models,” in Advances in neural information processing systems, pp. 288–296, 2009.
Cuthill, E. and McKee, J., in “Reducing the bandwidth of sparse symmetric matrices,” Proc. of the 1969 24th National Conference, ACM, pp. 157–172, 1969.
Fei-Fei, L. and Perona, P., “A bayesian hierarchical model for learning natural scene categories,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2, pp. 524–531, 2005.
Griffiths, T. and Steyvers, M., “A probabilistic approach to semantic representation,” in Proc. of the 24th Annual Conference of the Cognitive Science Society, pp. 381–386, 2002.
Griffiths, T. and Steyvers, M., “Finding scientific topics,” in Proc. of the National academy of Sciences of the United States of America, 101, 1, National Acad Sciences, pp. 5228–5235, 2004.
Griffiths, T., Steyvers, M. and Tenenbaum, J., “Topics in semantic representation,” Psychological review, 114, 2, pp. 211, 2007.
Halevy, A., Norvig, P. and Pereira, F., “The unreasonable effectiveness of data,” IEEE Intelligent Systems, 24, 2, pp. 8–12, 2009.
Harel, D. and Koren, Y., “On clustering using random walks,” Foundations of Software Technology and Theoretical Computer Science, pp. 18–41, 2001.
Hoffart, J., Suchanek, F., Berberich, K. and Weikum, G., “Yago2: a spatially and temporally enhanced knowledge base from wikipedia,” Artificial Intelligence, 194, pp. 28–61, 2013.
Hofmann, T., “Probabilistic latent semantic indexing,” in Proc. of the 22nd Annual International ACM Conference on Research and Development in Information Retrieval, ACM, pp. 50–57, 1999.
Hofmann, T., “Unsupervised learning by probabilistic latent semantic analysis,” Machine learning, 42, 1-2, pp. 177–196, 2001.
Jung, J.J. and Król, D., “Engineering knowledge and semantic systems,” in Computer Journal, 55, 3, ACM, pp. 256–257, 2012.
Kurzweil, R., How to Create a Mind: The Secret of Human Thought Revealed, Penguin.com, 2012.
Lancichinetti, A. and Fortunato, S., “Community detection algorithms: a comparative analysis,” Physical review E, 80, 5, 2009.
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P., Hellmann, S., Morsey, M., van Kleef, P., Auer, S. and others, “Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia,” Semantic Web Journal, 2013.
Moniruzzaman, A. and Hossain, S., “Nosql database: New era of databases for big data analytics-classification, characteristics and comparison,” International Journal of Database Theory & Application, 6, 4, 2013.
Newman, M., Networks: an introduction, Oxford University Press, 2009.
Newman, M., Barabási, A. and Watts, D., The structure and dynamics of networks, Princeton University Press, 2006.
Newman M.: “Mixing patterns in networks”. Physical Review E 67, 2 (2003)
Newman M., Girvan M.: “Finding and evaluating community structure in networks”. Physical review E 69, 2 (2004)
Ramirez, E., Brena, R., Magatti, D. and Stella, F., “Probabilistic metrics for soft-clustering and topic model validation,” in IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 1, pp. 406–412, 2010.
Ramirez, E., Brena, R., Magatti, D. and Stella, F., “Topic model validation,” Neurocomputing, 76, 1, pp. 125–133, 2012.
Rossetti, M., Stella, F. and Zanker, M., “Towards explaining latent factors with topic models in collaborative recommender systems,” in 24th International Workshop on Database and Expert Systems Applications (DEXA), pp. 162–167, 2013.
Rosvall, M. and Bergstrom, C., “Maps of random walks on complex networks reveal community structure,” in Proc. of the National Academy of Sciences, 105, 4, National Acad Sciences, pp. 1118–1123, 2008.
Sivic, J., Russell, B., Zisserman, A., Freeman, W. and Efros, A., “Unsupervised discovery of visual object class hierarchies,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE, 2008.
Steyvers, M. and Griffiths, T., “Probabilistic topic models,” in Handbook of Latent Semantic Analysis, Lawrence Erlbaum, pp. 427–448, 2007.
Tenenbaum, J., Kemp, C., Griffiths, T. and Goodman, N., “How to grow a mind: Statistics, structure, and abstraction,” Science, 331, 6022, pp. 1279–1285, 2011.
Wallach, H., Murray, I., Salakhutdinov, R. and Mimno, D., “Evaluation methods for topic models,” Proc. of the 26th Annual International Conference on Machine Learning, ACM, pp. 1105–1112, 2009.
Wang, C. and Blei, D., “Collaborative topic modeling for recommending scientific articles,” in Proc. of the 17th International Conference on Knowledge Discovery and Data Mining, ACM, pp. 448–456, 2011.
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Rossetti, M., Pareschi, R., Stella, F. et al. Integrating Concepts and Knowledge in Large Content Networks. New Gener. Comput. 32, 309–330 (2014). https://doi.org/10.1007/s00354-014-0407-4
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00354-014-0407-4