A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures

Franzoni, Valentina; Milani, Alfredo

doi:10.1007/978-3-319-42092-9_34

Valentina Franzoni^22,24 &
Alfredo Milani^23,24

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9790))

Included in the following conference series:

International Conference on Computational Science and Its Applications

1574 Accesses

Abstract

The Internet explosion and the massive diffusion of mobile devices lead to the creation of a worldwide collaborative system, daily used by millions of users through search engines and application interfaces. New paradigms permit to calculate the similarity of terms using only the statistical information returned by a query, or from additional features; also old algorithms and measures have been applied to new domains and scopes, to efficiently find words clusters from the Web. The problem of evaluating such techniques and algorithms in new domains emerges, and highlights a still open field of experimentation.

In this paper, preliminary tests have been held on different semantic proximity measures (average confidence, NGD, PMI, \(\chi ^{2}\), PMING Distance), and different clustering algorithms among the most used in literature have been compared (e.g. k-means, Expectation-Maximization, spectral clustering) for evaluating such measures. The suitability of the considered measures and methods to calculate the semantic proximity was verified at the state-of-art, and problems were identified, comparing the results of measurements to a ground truth provided by models of contextualized knowledge, clustering and human perception of semantic relations, which data are already studied in literature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 5719; Price includes VAT (Japan)

Softcover Book: JPY 7149; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Measurement of Semantic Similarity: A Concept Hierarchy Based Approach

An Analysis of Semantic Similarity Measures for Information Retrieval

Web Similarity in Sets of Search Terms Using Database Queries

Article 12 May 2020

Notes

1.
The local normalization of a measure \(\eta \) in a evaluation context W consists in evaluating max \(\eta \) of the values \(\eta (w_{i})\) for \(w_{i}\in W\) and substituting these values with \(\eta (w_{i})/\eta \).
2.
The complementation to 1 of the locally normalized PMI consists in substituting each locally normalized value with \((1-\eta (w_{i})/\eta )\) and eventually forcing null values in the diagonal of the adjacency matrix.
3.
The Kullback-Leibler divergence, also known as information gain or information divergence [24], measures the difference between two probability distribution P and Q, considering the expected value of the extra bits requested to encode examples from P when a code based on Q is used (i.e. represents the “true" distribution of the observations, or a theoretical distribution which correctly approximates P).
4.
The Laplacian matrix, or Kirchhoff matrix [28], is used to calculate the spanning tree number, i.e. the number of the trees which can be constructed starting from an indirect connected graph, formed by every vertex of some of (or all) the edges; in other words, a selection of the edges of a graph which expand on each vertex such as every vertex is inside the tree, without cycles. In spectral graphs, the Laplacian approximates the most sparse cut of the graph, through the second eigenvalue.

References

Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database (1993)
Google Scholar
Budanitsky, A., Hirst, G.: Semantic distance in wordnet: an experimental, application-oriented evaluation of five measures. In: Proceedings of Workshop on WordNet and Other Lexical Resources, p. 641. North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA, USA (2001)
Google Scholar
Franzoni, V., Milani, A.: PMING distance: a collaborative semantic proximity measure. WI-IAT 2, 442–449 (2012). IEEE/WIC/ACM
Google Scholar
Franzoni, V., Milani, A.: Heuristic semantic walk for concept chaining in collaborative networks. Int. J. Web Inf. Syst. 10(1), 85–103 (2014). doi:10.1108/IJWIS-11-2013-0031
Article Google Scholar
Franzoni, V., Milani, A.: Heuristic semantic walk. In: Murgante, B., Misra, S., Carlini, M., Torre, C.M., Nguyen, H.-Q., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2013, Part IV. LNCS, vol. 7974, pp. 643–656. Springer, Heidelberg (2013)
Chapter Google Scholar
Leung, C.H.C., Li, Y., Milani, A., Franzoni, V.: Collective evolutionary concept distance based query expansion for effective web document retrieval. In: Murgante, B., Misra, S., Carlini, M., Torre, C.M., Nguyen, H.-Q., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2013, Part IV. LNCS, vol. 7974, pp. 657–672. Springer, Heidelberg (2013)
Chapter Google Scholar
Cilibrasi, R., Vitanyi, P.: The google similarity distance. ArXiv.org (2004)
Google Scholar
Franzoni, V., Milani, A.: Semantic Context extraction from collaborative networks. In: IEEE International Conference on Computer Supported Cooperative Work in Design CSCWD, 2015, Italy (2015)
Google Scholar
Franzoni, V., Milani, A.: Context extraction by multi-path traces in semantic networks, CEUR-WS. In: Proceedings of RR2015 Doctoral Consortium, Berlin (2015)
Google Scholar
Franzoni, V., Leung, C.H.C., Li, Y., Mengoni, P., Milani, A.: Set similarity measures for images based on collective knowledge. In: Gervasi, O., Murgante, B., Misra, S., Gavrilova, M.L., Rocha, A.M.A.C., Torre, C., Taniar, D., Apduhan, B.O. (eds.) ICCSA 2015. LNCS, vol. 9155, pp. 408–417. Springer, Heidelberg (2015)
Chapter Google Scholar
Chiancone, A., Niyogi, R., et al.: Improving link ranking quality by quasi-common neighbourhood. In: 2015 IEEE CPS International Conference on Computational Science and Its Applications (2015)
Google Scholar
Chiancone, A., Madotto, A., et al.: Multistrain bacterial model for link prediction. In: 2015 Proceedings of 11th International Conference on Natural Computation IEEE ICNC. CFP15CNC-CDR, Observation of strains. Infect Dis Ther. 3(1), 35–43 (2011). ISBN: 978-1-4673-7678-5
Google Scholar
http://www.google.com
http://www.bing.com
http://yahoo.com
http://www.youtube.com
https://www.flickr.com
Franzoni, V., Milani, A., Pallottelli, S.: Multi-path traces in semantic graphs for latent knowledge elicitation. In: 2015 IEEE ICNC Proceedings of 11th International Conference on Natural Computation, CFP15CNC-CDR (2015). ISBN: 978-1-4673-7678-5
Google Scholar
Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Proceedings of ACL, vol. 27 (1989)
Google Scholar
Turney, P.D.: Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, p. 491. Springer, Heidelberg (2001)
Chapter Google Scholar
Newman, M.E.J.: Fast algorithm for detecting community structure in networks. University of Michigan, MI (2003)
Google Scholar
Matsuo, Y., Sakaki, T., Uchiyama, K., Ishizuka, M.: Graph-based word clustering using a web search engine. University of Tokio (2006)
Google Scholar
Dellaert, F.: The Expectation-Maximization Algorithm. Elsevier, New York (2002)
Google Scholar
Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (1991)
Article MathSciNet MATH Google Scholar
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. J. Roy. Stat. Soc. 28, 100–108 (1979)
MATH Google Scholar
Joyce, J.M.: Kullback-Leibler Divergence: International Encyclopedia of Statistical Science. Springer, Heidelberg (2011)
Book Google Scholar
Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD (2004)
Google Scholar
Pozrikidis, C.: Node degree distribution in spanning trees. J. Phys. A: Math. Theoret. 49(12) (2016)
Google Scholar
Bastianelli, E., Croce, D., Basili, R., Nardi, D.: Using semantic models for robust natural language human robot interaction. In: Gavanelli, M., et al. (eds.) AI*IA 2015. LNCS, vol. 9336, pp. 343–356. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24309-2_26. http://dx.doi.org/10.1007/978-3-319-24309-2
Chapter Google Scholar
Wu, L., Hua, X.S., Yu, N., Ma, W.Y., Li, S.: Flickr Distance. In: Proceedings of Microsoft Research Asia (2008)
Google Scholar
Manning, D., Schutze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, London (2002)
MATH Google Scholar
Franzoni, V., Poggioni, V., Zollo, F.: Automated book classification according to the emotional tags of the social network Zazie. In: CEUR-WS on ESSEM, AI*IA, vol. 1096, pp. 83–94 (2013)
Google Scholar
Franzoni, V., Leung, C.H.C., Li, Y., Milani, A., Pallottelli, S.: Context-based image semantic similarity. In: 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Zhangjiajie, pp. 1280–1284 (2015). doi:10.1109/FSKD.2015.7382127
Chiancone, A., Franzoni, V., Li, Y., Markov, K., Milani, A.: Leveraging zero tail in neighbourhood based link prediction. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 3, pp. 135–139 (2015). doi:10.1109/WI-IAT.2015.129
Franzoni, V., Milani, A.: A Pheromone-like model for semantic context extraction from collaborative networks. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Singapore, pp. 540–547 (2015). doi:10.1109/WI-IAT.2015.21
Di Iorio, A., Schaerf, M.: Identification semantics for an organization establishing a digital library system. In: Risse, T., Predoiu, L., Nürnberger, A., Ross, S. (eds.) Proceedings of the 4th International Workshop on Semantic Digital Archives (SDA 2014) Co-located with the International Digital Libraries Conference (DL 2014), vol. 1306, London, UK, September 12, 2014, pp. 16–27. CEUR-WS.org (2014). http://ceur-ws.org/Vol-1306

Download references

Author information

Authors and Affiliations

Department of Computer, Control, and Management Engineering, La Sapienza University of Rome, Rome, Italy
Valentina Franzoni
Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong
Alfredo Milani
Department of Mathematics and Computer Science, University of Perugia, Perugia, Italy
Valentina Franzoni & Alfredo Milani

Authors

Valentina Franzoni
View author publications
You can also search for this author in PubMed Google Scholar
Alfredo Milani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valentina Franzoni .

Editor information

Editors and Affiliations

University of Perugia, Perugia, Italy
Osvaldo Gervasi
University of Basilicata, Potenza, Italy
Beniamino Murgante
Covenant University, Ota, Nigeria
Sanjay Misra
University of Minho, Braga, Portugal
Ana Maria A.C. Rocha
Polytechnic University, Bari, Italy
Carmelo M. Torre
Monash University, Clayton, Victoria, Australia
David Taniar
Kyushu Sangyo University, Fukuoka, Japan
Bernady O. Apduhan
Saint Petersburg State University, Saint Petersburg, Russia
Elena Stankova
Beijing University of Posts and Telecommunications, Beijing, China
Shangguang Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Franzoni, V., Milani, A. (2016). A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2016. ICCSA 2016. Lecture Notes in Computer Science(), vol 9790. Springer, Cham. https://doi.org/10.1007/978-3-319-42092-9_34

Download citation

DOI: https://doi.org/10.1007/978-3-319-42092-9_34
Published: 01 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42091-2
Online ISBN: 978-3-319-42092-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics