Abstract
The Internet explosion and the massive diffusion of mobile devices lead to the creation of a worldwide collaborative system, daily used by millions of users through search engines and application interfaces. New paradigms permit to calculate the similarity of terms using only the statistical information returned by a query, or from additional features; also old algorithms and measures have been applied to new domains and scopes, to efficiently find words clusters from the Web. The problem of evaluating such techniques and algorithms in new domains emerges, and highlights a still open field of experimentation.
In this paper, preliminary tests have been held on different semantic proximity measures (average confidence, NGD, PMI, \(\chi ^{2}\), PMING Distance), and different clustering algorithms among the most used in literature have been compared (e.g. k-means, Expectation-Maximization, spectral clustering) for evaluating such measures. The suitability of the considered measures and methods to calculate the semantic proximity was verified at the state-of-art, and problems were identified, comparing the results of measurements to a ground truth provided by models of contextualized knowledge, clustering and human perception of semantic relations, which data are already studied in literature.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The local normalization of a measure \(\eta \) in a evaluation context W consists in evaluating max \(\eta \) of the values \(\eta (w_{i})\) for \(w_{i}\in W\) and substituting these values with \(\eta (w_{i})/\eta \).
- 2.
The complementation to 1 of the locally normalized PMI consists in substituting each locally normalized value with \((1-\eta (w_{i})/\eta )\) and eventually forcing null values in the diagonal of the adjacency matrix.
- 3.
The Kullback-Leibler divergence, also known as information gain or information divergence [24], measures the difference between two probability distribution P and Q, considering the expected value of the extra bits requested to encode examples from P when a code based on Q is used (i.e. represents the “true" distribution of the observations, or a theoretical distribution which correctly approximates P).
- 4.
The Laplacian matrix, or Kirchhoff matrix [28], is used to calculate the spanning tree number, i.e. the number of the trees which can be constructed starting from an indirect connected graph, formed by every vertex of some of (or all) the edges; in other words, a selection of the edges of a graph which expand on each vertex such as every vertex is inside the tree, without cycles. In spectral graphs, the Laplacian approximates the most sparse cut of the graph, through the second eigenvalue.
References
Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database (1993)
Budanitsky, A., Hirst, G.: Semantic distance in wordnet: an experimental, application-oriented evaluation of five measures. In: Proceedings of Workshop on WordNet and Other Lexical Resources, p. 641. North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA, USA (2001)
Franzoni, V., Milani, A.: PMING distance: a collaborative semantic proximity measure. WI-IAT 2, 442–449 (2012). IEEE/WIC/ACM
Franzoni, V., Milani, A.: Heuristic semantic walk for concept chaining in collaborative networks. Int. J. Web Inf. Syst. 10(1), 85–103 (2014). doi:10.1108/IJWIS-11-2013-0031
Franzoni, V., Milani, A.: Heuristic semantic walk. In: Murgante, B., Misra, S., Carlini, M., Torre, C.M., Nguyen, H.-Q., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2013, Part IV. LNCS, vol. 7974, pp. 643–656. Springer, Heidelberg (2013)
Leung, C.H.C., Li, Y., Milani, A., Franzoni, V.: Collective evolutionary concept distance based query expansion for effective web document retrieval. In: Murgante, B., Misra, S., Carlini, M., Torre, C.M., Nguyen, H.-Q., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2013, Part IV. LNCS, vol. 7974, pp. 657–672. Springer, Heidelberg (2013)
Cilibrasi, R., Vitanyi, P.: The google similarity distance. ArXiv.org (2004)
Franzoni, V., Milani, A.: Semantic Context extraction from collaborative networks. In: IEEE International Conference on Computer Supported Cooperative Work in Design CSCWD, 2015, Italy (2015)
Franzoni, V., Milani, A.: Context extraction by multi-path traces in semantic networks, CEUR-WS. In: Proceedings of RR2015 Doctoral Consortium, Berlin (2015)
Franzoni, V., Leung, C.H.C., Li, Y., Mengoni, P., Milani, A.: Set similarity measures for images based on collective knowledge. In: Gervasi, O., Murgante, B., Misra, S., Gavrilova, M.L., Rocha, A.M.A.C., Torre, C., Taniar, D., Apduhan, B.O. (eds.) ICCSA 2015. LNCS, vol. 9155, pp. 408–417. Springer, Heidelberg (2015)
Chiancone, A., Niyogi, R., et al.: Improving link ranking quality by quasi-common neighbourhood. In: 2015 IEEE CPS International Conference on Computational Science and Its Applications (2015)
Chiancone, A., Madotto, A., et al.: Multistrain bacterial model for link prediction. In: 2015 Proceedings of 11th International Conference on Natural Computation IEEE ICNC. CFP15CNC-CDR, Observation of strains. Infect Dis Ther. 3(1), 35–43 (2011). ISBN: 978-1-4673-7678-5
Franzoni, V., Milani, A., Pallottelli, S.: Multi-path traces in semantic graphs for latent knowledge elicitation. In: 2015 IEEE ICNC Proceedings of 11th International Conference on Natural Computation, CFP15CNC-CDR (2015). ISBN: 978-1-4673-7678-5
Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Proceedings of ACL, vol. 27 (1989)
Turney, P.D.: Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, p. 491. Springer, Heidelberg (2001)
Newman, M.E.J.: Fast algorithm for detecting community structure in networks. University of Michigan, MI (2003)
Matsuo, Y., Sakaki, T., Uchiyama, K., Ishizuka, M.: Graph-based word clustering using a web search engine. University of Tokio (2006)
Dellaert, F.: The Expectation-Maximization Algorithm. Elsevier, New York (2002)
Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (1991)
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. J. Roy. Stat. Soc. 28, 100–108 (1979)
Joyce, J.M.: Kullback-Leibler Divergence: International Encyclopedia of Statistical Science. Springer, Heidelberg (2011)
Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD (2004)
Pozrikidis, C.: Node degree distribution in spanning trees. J. Phys. A: Math. Theoret. 49(12) (2016)
Bastianelli, E., Croce, D., Basili, R., Nardi, D.: Using semantic models for robust natural language human robot interaction. In: Gavanelli, M., et al. (eds.) AI*IA 2015. LNCS, vol. 9336, pp. 343–356. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24309-2_26. http://dx.doi.org/10.1007/978-3-319-24309-2
Wu, L., Hua, X.S., Yu, N., Ma, W.Y., Li, S.: Flickr Distance. In: Proceedings of Microsoft Research Asia (2008)
Manning, D., Schutze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, London (2002)
Franzoni, V., Poggioni, V., Zollo, F.: Automated book classification according to the emotional tags of the social network Zazie. In: CEUR-WS on ESSEM, AI*IA, vol. 1096, pp. 83–94 (2013)
Franzoni, V., Leung, C.H.C., Li, Y., Milani, A., Pallottelli, S.: Context-based image semantic similarity. In: 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Zhangjiajie, pp. 1280–1284 (2015). doi:10.1109/FSKD.2015.7382127
Chiancone, A., Franzoni, V., Li, Y., Markov, K., Milani, A.: Leveraging zero tail in neighbourhood based link prediction. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 3, pp. 135–139 (2015). doi:10.1109/WI-IAT.2015.129
Franzoni, V., Milani, A.: A Pheromone-like model for semantic context extraction from collaborative networks. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Singapore, pp. 540–547 (2015). doi:10.1109/WI-IAT.2015.21
Di Iorio, A., Schaerf, M.: Identification semantics for an organization establishing a digital library system. In: Risse, T., Predoiu, L., Nürnberger, A., Ross, S. (eds.) Proceedings of the 4th International Workshop on Semantic Digital Archives (SDA 2014) Co-located with the International Digital Libraries Conference (DL 2014), vol. 1306, London, UK, September 12, 2014, pp. 16–27. CEUR-WS.org (2014). http://ceur-ws.org/Vol-1306
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Franzoni, V., Milani, A. (2016). A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2016. ICCSA 2016. Lecture Notes in Computer Science(), vol 9790. Springer, Cham. https://doi.org/10.1007/978-3-319-42092-9_34
Download citation
DOI: https://doi.org/10.1007/978-3-319-42092-9_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42091-2
Online ISBN: 978-3-319-42092-9
eBook Packages: Computer ScienceComputer Science (R0)