A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures | SpringerLink
Skip to main content

A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures

  • Conference paper
  • First Online:
Computational Science and Its Applications – ICCSA 2016 (ICCSA 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9790))

Included in the following conference series:

  • 1574 Accesses

Abstract

The Internet explosion and the massive diffusion of mobile devices lead to the creation of a worldwide collaborative system, daily used by millions of users through search engines and application interfaces. New paradigms permit to calculate the similarity of terms using only the statistical information returned by a query, or from additional features; also old algorithms and measures have been applied to new domains and scopes, to efficiently find words clusters from the Web. The problem of evaluating such techniques and algorithms in new domains emerges, and highlights a still open field of experimentation.

In this paper, preliminary tests have been held on different semantic proximity measures (average confidence, NGD, PMI, \(\chi ^{2}\), PMING Distance), and different clustering algorithms among the most used in literature have been compared (e.g. k-means, Expectation-Maximization, spectral clustering) for evaluating such measures. The suitability of the considered measures and methods to calculate the semantic proximity was verified at the state-of-art, and problems were identified, comparing the results of measurements to a ground truth provided by models of contextualized knowledge, clustering and human perception of semantic relations, which data are already studied in literature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The local normalization of a measure \(\eta \) in a evaluation context W consists in evaluating max \(\eta \) of the values \(\eta (w_{i})\) for \(w_{i}\in W\) and substituting these values with \(\eta (w_{i})/\eta \).

  2. 2.

    The complementation to 1 of the locally normalized PMI consists in substituting each locally normalized value with \((1-\eta (w_{i})/\eta )\) and eventually forcing null values in the diagonal of the adjacency matrix.

  3. 3.

    The Kullback-Leibler divergence, also known as information gain or information divergence [24], measures the difference between two probability distribution P and Q, considering the expected value of the extra bits requested to encode examples from P when a code based on Q is used (i.e. represents the “true" distribution of the observations, or a theoretical distribution which correctly approximates P).

  4. 4.

    The Laplacian matrix, or Kirchhoff matrix [28], is used to calculate the spanning tree number, i.e. the number of the trees which can be constructed starting from an indirect connected graph, formed by every vertex of some of (or all) the edges; in other words, a selection of the edges of a graph which expand on each vertex such as every vertex is inside the tree, without cycles. In spectral graphs, the Laplacian approximates the most sparse cut of the graph, through the second eigenvalue.

References

  1. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database (1993)

    Google Scholar 

  2. Budanitsky, A., Hirst, G.: Semantic distance in wordnet: an experimental, application-oriented evaluation of five measures. In: Proceedings of Workshop on WordNet and Other Lexical Resources, p. 641. North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA, USA (2001)

    Google Scholar 

  3. Franzoni, V., Milani, A.: PMING distance: a collaborative semantic proximity measure. WI-IAT 2, 442–449 (2012). IEEE/WIC/ACM

    Google Scholar 

  4. Franzoni, V., Milani, A.: Heuristic semantic walk for concept chaining in collaborative networks. Int. J. Web Inf. Syst. 10(1), 85–103 (2014). doi:10.1108/IJWIS-11-2013-0031

    Article  Google Scholar 

  5. Franzoni, V., Milani, A.: Heuristic semantic walk. In: Murgante, B., Misra, S., Carlini, M., Torre, C.M., Nguyen, H.-Q., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2013, Part IV. LNCS, vol. 7974, pp. 643–656. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  6. Leung, C.H.C., Li, Y., Milani, A., Franzoni, V.: Collective evolutionary concept distance based query expansion for effective web document retrieval. In: Murgante, B., Misra, S., Carlini, M., Torre, C.M., Nguyen, H.-Q., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2013, Part IV. LNCS, vol. 7974, pp. 657–672. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  7. Cilibrasi, R., Vitanyi, P.: The google similarity distance. ArXiv.org (2004)

    Google Scholar 

  8. Franzoni, V., Milani, A.: Semantic Context extraction from collaborative networks. In: IEEE International Conference on Computer Supported Cooperative Work in Design CSCWD, 2015, Italy (2015)

    Google Scholar 

  9. Franzoni, V., Milani, A.: Context extraction by multi-path traces in semantic networks, CEUR-WS. In: Proceedings of RR2015 Doctoral Consortium, Berlin (2015)

    Google Scholar 

  10. Franzoni, V., Leung, C.H.C., Li, Y., Mengoni, P., Milani, A.: Set similarity measures for images based on collective knowledge. In: Gervasi, O., Murgante, B., Misra, S., Gavrilova, M.L., Rocha, A.M.A.C., Torre, C., Taniar, D., Apduhan, B.O. (eds.) ICCSA 2015. LNCS, vol. 9155, pp. 408–417. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  11. Chiancone, A., Niyogi, R., et al.: Improving link ranking quality by quasi-common neighbourhood. In: 2015 IEEE CPS International Conference on Computational Science and Its Applications (2015)

    Google Scholar 

  12. Chiancone, A., Madotto, A., et al.: Multistrain bacterial model for link prediction. In: 2015 Proceedings of 11th International Conference on Natural Computation IEEE ICNC. CFP15CNC-CDR, Observation of strains. Infect Dis Ther. 3(1), 35–43 (2011). ISBN: 978-1-4673-7678-5

    Google Scholar 

  13. http://www.google.com

  14. http://www.bing.com

  15. http://yahoo.com

  16. http://www.youtube.com

  17. https://www.flickr.com

  18. Franzoni, V., Milani, A., Pallottelli, S.: Multi-path traces in semantic graphs for latent knowledge elicitation. In: 2015 IEEE ICNC Proceedings of 11th International Conference on Natural Computation, CFP15CNC-CDR (2015). ISBN: 978-1-4673-7678-5

    Google Scholar 

  19. Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Proceedings of ACL, vol. 27 (1989)

    Google Scholar 

  20. Turney, P.D.: Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, p. 491. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  21. Newman, M.E.J.: Fast algorithm for detecting community structure in networks. University of Michigan, MI (2003)

    Google Scholar 

  22. Matsuo, Y., Sakaki, T., Uchiyama, K., Ishizuka, M.: Graph-based word clustering using a web search engine. University of Tokio (2006)

    Google Scholar 

  23. Dellaert, F.: The Expectation-Maximization Algorithm. Elsevier, New York (2002)

    Google Scholar 

  24. Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  25. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. J. Roy. Stat. Soc. 28, 100–108 (1979)

    MATH  Google Scholar 

  26. Joyce, J.M.: Kullback-Leibler Divergence: International Encyclopedia of Statistical Science. Springer, Heidelberg (2011)

    Book  Google Scholar 

  27. Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD (2004)

    Google Scholar 

  28. Pozrikidis, C.: Node degree distribution in spanning trees. J. Phys. A: Math. Theoret. 49(12) (2016)

    Google Scholar 

  29. Bastianelli, E., Croce, D., Basili, R., Nardi, D.: Using semantic models for robust natural language human robot interaction. In: Gavanelli, M., et al. (eds.) AI*IA 2015. LNCS, vol. 9336, pp. 343–356. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24309-2_26. http://dx.doi.org/10.1007/978-3-319-24309-2

    Chapter  Google Scholar 

  30. Wu, L., Hua, X.S., Yu, N., Ma, W.Y., Li, S.: Flickr Distance. In: Proceedings of Microsoft Research Asia (2008)

    Google Scholar 

  31. Manning, D., Schutze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, London (2002)

    MATH  Google Scholar 

  32. Franzoni, V., Poggioni, V., Zollo, F.: Automated book classification according to the emotional tags of the social network Zazie. In: CEUR-WS on ESSEM, AI*IA, vol. 1096, pp. 83–94 (2013)

    Google Scholar 

  33. Franzoni, V., Leung, C.H.C., Li, Y., Milani, A., Pallottelli, S.: Context-based image semantic similarity. In: 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Zhangjiajie, pp. 1280–1284 (2015). doi:10.1109/FSKD.2015.7382127

  34. Chiancone, A., Franzoni, V., Li, Y., Markov, K., Milani, A.: Leveraging zero tail in neighbourhood based link prediction. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 3, pp. 135–139 (2015). doi:10.1109/WI-IAT.2015.129

  35. Franzoni, V., Milani, A.: A Pheromone-like model for semantic context extraction from collaborative networks. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Singapore, pp. 540–547 (2015). doi:10.1109/WI-IAT.2015.21

  36. Di Iorio, A., Schaerf, M.: Identification semantics for an organization establishing a digital library system. In: Risse, T., Predoiu, L., Nürnberger, A., Ross, S. (eds.) Proceedings of the 4th International Workshop on Semantic Digital Archives (SDA 2014) Co-located with the International Digital Libraries Conference (DL 2014), vol. 1306, London, UK, September 12, 2014, pp. 16–27. CEUR-WS.org (2014). http://ceur-ws.org/Vol-1306

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Valentina Franzoni .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Franzoni, V., Milani, A. (2016). A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2016. ICCSA 2016. Lecture Notes in Computer Science(), vol 9790. Springer, Cham. https://doi.org/10.1007/978-3-319-42092-9_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-42092-9_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-42091-2

  • Online ISBN: 978-3-319-42092-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics